Data Science & Analytics Trainee: November 2018

Sydney Crime Spot

Today I will perform an analysis on a dataset related to crime in Sydney. The dataset reflect criminal incidents recorded by the NSW Police Force on their Computerised Operational Policing System (COPS) between January 2013 and March 2016.

The objective of the analysis is predict the type of crime based on local where it ocurred (Intersect, Street,…), day of week, season and suburb.

Variables description:

*Bcsrgrp: top level offence category

*Locsurb: suburb where the incident occurred

*Bcsrgccde: accuracy of the coordinates of the location of the incident

*Incmonth: the month the event occurred in

*Incday: the day of the week the incident occurred on

I will start with a exploratory analysis of the main variables that will be used and after, I will run a Multinomial Logistic Regression to predict type of crime, and give an example of the interpretation of the results obtained.

Set working directory to the folder where I have downloaded the datasets

setwd("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/output")

Importing Data

library(readxl)
outdoor_crime_data= read_excel("outdoor-crime-data.xlsx",col_types = c("numeric", "numeric", "text", 
        "text", "text", "text", "text", "numeric", 
        "numeric", "numeric", "text", "numeric", 
        "text", "text", "date", "numeric", 
        "text", "text", "text", "numeric"))
data=outdoor_crime_data

Setting variables to factors as they are categorical

data$locsurb=as.factor(data$locsurb)
data$incday=as.factor(data$incday)
#Here i set the levels of factor ordered first by Drug Offences since this is the type of crime that ocurrs more often and I want to perform the model with this level as comparison level 
data$bcsrgrp=factor(data$bcsrgrp,levels=c("Drug Offences","Theft","Assault","Malicious Damage","Robbery"))
data$incmonth=as.factor(data$incmonth)
data$incday=as.factor(data$incday)

Data Featuring

#Setting time tags for times (set intervals of 6 hours)
library(chron)
data$incsttm <- times(format(data$incsttm, "%H:%M:%S"))
time.tag = chron(times= c("00:00:00", "06:00:00", "12:00:00", "18:00:00","23:59:00"))
data$time.tag = cut(data$incsttm, breaks= time.tag,labels = c("00-06","06-12","12-18","18-00"),include.lowest = TRUE)
table(data$time.tag)

## 
## 00-06 06-12 12-18 18-00 
##  5404  3894  6565  7742

#changing variable month to season
library(plyr)
data$season <- as.factor(ifelse(data$incmonth %in% c("March", "April",
                                                     "May"), "spring",
                                ifelse(data$incmonth %in% c("June", "July",
                                                            "August"), "summer",
                                       ifelse(data$incmonth %in% c("September", "October",
                                                                   "November"), "fall",                                                                                   "winter"))))
table(data$season)

## 
##   fall spring summer winter 
##   5457   6041   5211   6896

#adding column suburb region - I created a file with each suburb and respective area
surb <- read_excel("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/data/surb.xlsx")
suburb_merge=merge(data,surb, by="locsurb")
data$suburbregio=suburb_merge$`suburb region`
table(data$suburbregio)

## 
##             Eastern Suburbs          Inner West Suburbs 
##                        9589                        9316 
##                      Center Inner North Western Suburbs 
##                        4015                           6 
##         Inner South Suburbs       north-eastern suburbs 
##                          35                          81 
##       south-eastern suburbs 
##                         563

Visualization

Prevalence of different crimes seems to be unevenly distributed with Drug offences being much more frequent:

library(ggplot2) 
qplot(data$bcsrgrp, xlab= "Crime", main = "Crimes")+scale_y_continuous("Number of crimes")

It would be interesting to look at how crimes are distributed with respect to location, time of day, day of week, and month:

qplot(data$locsurb, xlab= "Locstion", main = "Crimes over Location") + scale_y_continuous("Number of crimes")+coord_flip()

qplot(data$time.tag, xlab= "Time of day", main = "Crimes over Time") +     
  scale_y_continuous("Number of crimes")

qplot(data$incday, xlab= "Day of Week", main = "Crimes over Day of Week") +  
  scale_y_continuous("Number of crimes")

qplot(data$incmonth, xlab= "Month", main = "Crimes over Month") +scale_y_continuous("Number of crimes")

Most of the crimes occur in Sydney, Surry Hills, Potts Point and Darlinghurst.

There does seem to exist a pattern in the occurrence of crime with respect to the dimension of time. The latter part of the day, Saturdays, and January witness more crime incidents, with respect to other corresponding time periods

Variables Association Chi Square Tests

bcsrgrp	season				Total
bcsrgrp	fall	spring	summer	winter	Total
Drug Offences	2911	3338	2773	3863	12885
Theft	1264	1246	1232	1390	5132
Assault	1030	1150	929	1332	4441
Malicious Damage	51	51	58	54	214
Robbery	201	256	219	257	933
Total	5457	6041	5211	6896	23605
χ²=43.242 · df=12 · Cramer’s V=0.025 · p=0.000

There is association between type of crime and season (p-value~0) but the association is very weak( Cramer’s V = 0,025

sjt.xtab(data$bcsrgrp, data$bcsrgccde)

bcsrgrp	bcsrgccde						Total
bcsrgrp	Address	Intersect	Landmark	Postcode	Street	Suburb	Total
Drug Offences	1015	2821	1180	451	7277	141	12885
Theft	1235	1325	15	355	2009	193	5132
Assault	562	1108	439	273	1968	91	4441
Malicious Damage	31	54	25	11	88	5	214
Robbery	78	233	118	69	414	21	933
Total	2921	5541	1777	1159	11756	451	23605
χ²=1819.969 · df=20 · Cramer’s V=0.139 · Fisher’s p=0.000

There is association between type of crime and local (p-value~0) and the association is weak ( Cramer’s V = 0,139)

sjt.xtab(data$bcsrgrp, data$suburbregio)

bcsrgrp	suburbregio							Total
bcsrgrp	Â Eastern Suburbs	Â Inner West Suburbs	Center	Inner North Western Suburbs	Inner South Suburbs	north-eastern suburbs	south-eastern suburbs	Total
Drug Offences	7274	2729	2327	0	0	0	555	12885
Theft	1725	1764	1635	0	0	0	8	5132
Assault	439	3921	0	0	0	81	0	4441
Malicious Damage	74	87	53	0	0	0	0	214
Robbery	77	815	0	6	35	0	0	933
Total	9589	9316	4015	6	35	81	563	23605
χ²=9863.569 · df=24 · Cramer’s V=0.323 · Fisher’s p=0.000

There is association between type of crime and suburbs (p-value~0) and the association is moderate ( Cramer’s V = 0,323)

sjt.xtab(data$bcsrgrp, data$incday)

bcsrgrp	incday							Total
bcsrgrp	Friday	Monday	Saturday	Sunday	Thursday	Tuesday	Wednesday	Total
Drug Offences	2361	892	3118	1636	2077	1160	1641	12885
Theft	813	689	871	724	687	688	660	5132
Assault	620	429	959	1023	522	452	436	4441
Malicious Damage	37	31	37	38	19	27	25	214
Robbery	133	141	162	163	107	122	105	933
Total	3964	2182	5147	3584	3412	2449	2867	23605
χ²=764.367 · df=24 · Cramer’s V=0.090 · p=0.000

There is association between type of crime and day of week (p-value~0) but the association is very weak ( Cramer’s V = 0,09)

Modeling

As you could notice, there are some suburb regions with very low frequency. These I will exclude from the analysis since they are likely to be outliers and can influence the model.

library(Hmisc)
data_subset=subset(data,suburbregio%nin%"Inner South Suburbs" & suburbregio%nin%"Inner North Western Suburbs" & suburbregio%nin%"north-eastern suburbs" & suburbregio%nin%"south-eastern suburbs")

#Spliting data into train and test dataset
n = nrow(data_subset)
n_train = round(0.90 * n)
set.seed(123) # set a random seed for reproducibility 
train_indices = sample(1:n, n_train)
train= data_subset[train_indices, ]
test = data_subset[-train_indices, ]

#Model
library(nnet)
model =multinom(bcsrgrp ~ bcsrgccde+season+incday+suburbregio, data = train)

## # weights:  90 (68 variable)
## initial  value 33199.485258 
## iter  10 value 19244.382048
## iter  20 value 19148.923039
## iter  30 value 18966.507730
## iter  40 value 18925.506163
## iter  50 value 18877.158584
## iter  60 value 18841.753676
## iter  70 value 18834.134958
## iter  80 value 18829.903720
## iter  90 value 18829.503426
## final  value 18829.502386 
## converged

#Predictions
predicted_class=predict(model,test)
table(predicted_class,test$bcsrgrp)

##                   
## predicted_class    Drug Offences Theft Assault Malicious Damage Robbery
##   Drug Offences             1011   305     135               10      15
##   Theft                       60    95      19                1       4
##   Assault                    172   130     272                7      56
##   Malicious Damage             0     0       0                0       0
##   Robbery                      0     0       0                0       0

#Accuracy
1-mean(as.character(predicted_class) != as.character(test$bcsrgrp))

## [1] 0.6012216

Model Interpretation

library(stargazer)
multi1.rrr = exp(coef(model))
stargazer(model, type="text", coef=list(multi1.rrr), p.auto=FALSE, out="resultsodds.htm")

## 
## ================================================================================
##                                               Dependent variable:               
##                                -------------------------------------------------
##                                  Theft     Assault   Malicious Damage  Robbery  
##                                   (1)        (2)           (3)           (4)    
## --------------------------------------------------------------------------------
## bcsrgccdeIntersect              0.387***   0.648***      0.549**        0.971   
##                                 (0.060)    (0.080)       (0.235)       (0.153)  
##                                                                                 
## bcsrgccdeLandmark               0.011***   0.516***       0.615*        1.094   
##                                 (0.274)    (0.100)       (0.285)       (0.173)  
##                                                                                 
## bcsrgccdePostcode               0.626***    0.983         0.778        1.803*** 
##                                 (0.092)    (0.119)       (0.358)       (0.200)  
##                                                                                 
## bcsrgccdeStreet                 0.221***   0.436***      0.347***      0.688*** 
##                                 (0.055)    (0.073)       (0.215)       (0.143)  
##                                                                                 
## bcsrgccdeSuburb                  1.033      1.205         0.889        1.979**  
##                                 (0.131)    (0.183)       (0.542)       (0.305)  
##                                                                                 
## seasonspring                     1.007     1.276***       0.965        1.658*** 
##                                 (0.054)    (0.064)       (0.211)       (0.111)  
##                                                                                 
## seasonsummer                     1.030      1.028         1.244         1.219*  
##                                 (0.055)    (0.066)       (0.201)       (0.117)  
##                                                                                 
## seasonwinter                    0.897**     1.002         0.837         1.105   
##                                 (0.053)    (0.062)       (0.206)       (0.110)  
##                                                                                 
## incdayMonday                    2.420***   1.915***      2.289***      3.146*** 
##                                 (0.074)    (0.093)       (0.260)       (0.145)  
##                                                                                 
## incdaySaturday                   0.921      1.135*        0.764         0.944   
##                                 (0.063)    (0.073)       (0.247)       (0.134)  
##                                                                                 
## incdaySunday                    1.432***   2.342***       1.509*       1.830*** 
##                                 (0.069)    (0.077)       (0.248)       (0.137)  
##                                                                                 
## incdayThursday                   1.060      0.899         0.641         0.827   
##                                 (0.068)    (0.083)       (0.294)       (0.153)  
##                                                                                 
## incdayTuesday                   1.732***   1.276***       1.459        1.541*** 
##                                 (0.072)    (0.088)       (0.267)       (0.150)  
##                                                                                 
## incdayWednesday                 1.333***    0.865*        1.111         1.084   
##                                 (0.070)    (0.088)       (0.267)       (0.152)  
##                                                                                 
## suburbregio Inner West Suburbs  2.880***  24.119***      3.139***     27.527*** 
##                                 (0.045)    (0.059)       (0.168)       (0.125)  
##                                                                                 
## suburbregioCenter               2.984***   0.00000       2.191***      0.000*** 
##                                 (0.047)    (93.270)      (0.191)       (0.000)  
##                                                                                 
## Constant                        0.585***   0.080***      0.020***      0.008*** 
##                                 (0.075)    (0.103)       (0.292)       (0.209)  
##                                                                                 
## --------------------------------------------------------------------------------
## Akaike Inf. Crit.              37,795.000 37,795.000    37,795.000    37,795.000
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

Interpretation is done taking into account the reference category (the one that is not present in the output).

The values in the output represent the relative risk ratios, which allow an easier interpretation of the logit coefficients. They are the exponentiated value of the logit coefficients. As an example:

Keeping all other variables constant, when comparing to the occurrence of Drug Offences:

*The odds of witness a Theft crime will decrease by 60% (0.387-1 x100) if being in an Intersect vs Address

*The odds of witness an Assault will decrease by 35% (0.648-1 x100) if being in an Intersect vs Address

*The odds of witness a Malicious Demage will decrease by 45% (0.569-1 x100) if being in an Intersect vs Address

Data Science & Analytics Trainee

Wednesday, 14 November 2018

7th Project: Multinomial Logistic Regression - Predicting Crime in Sydney