Wednesday 14 November 2018

7th Project: Multinomial Logistic Regression - Predicting Crime in Sydney

Sydney Crime Spot

Today I will perform an analysis on a dataset related to crime in Sydney. The dataset reflect criminal incidents recorded by the NSW Police Force on their Computerised Operational Policing System (COPS) between January 2013 and March 2016.

The objective of the analysis is predict the type of crime based on local where it ocurred (Intersect, Street,…), day of week, season and suburb.

Variables description:

*Bcsrgrp: top level offence category

*Locsurb: suburb where the incident occurred

*Bcsrgccde: accuracy of the coordinates of the location of the incident

*Incmonth: the month the event occurred in

*Incday: the day of the week the incident occurred on

I will start with a exploratory analysis of the main variables that will be used and after, I will run a Multinomial Logistic Regression to predict type of crime, and give an example of the interpretation of the results obtained.

Set working directory to the folder where I have downloaded the datasets

setwd("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/output")

Importing Data

library(readxl)
outdoor_crime_data= read_excel("outdoor-crime-data.xlsx",col_types = c("numeric", "numeric", "text", 
        "text", "text", "text", "text", "numeric", 
        "numeric", "numeric", "text", "numeric", 
        "text", "text", "date", "numeric", 
        "text", "text", "text", "numeric"))
data=outdoor_crime_data

Setting variables to factors as they are categorical

data$locsurb=as.factor(data$locsurb)
data$incday=as.factor(data$incday)
#Here i set the levels of factor ordered first by Drug Offences since this is the type of crime that ocurrs more often and I want to perform the model with this level as comparison level 
data$bcsrgrp=factor(data$bcsrgrp,levels=c("Drug Offences","Theft","Assault","Malicious Damage","Robbery"))
data$incmonth=as.factor(data$incmonth)
data$incday=as.factor(data$incday)

Data Featuring

#Setting time tags for times (set intervals of 6 hours)
library(chron)
data$incsttm <- times(format(data$incsttm, "%H:%M:%S"))
time.tag = chron(times= c("00:00:00", "06:00:00", "12:00:00", "18:00:00","23:59:00"))
data$time.tag = cut(data$incsttm, breaks= time.tag,labels = c("00-06","06-12","12-18","18-00"),include.lowest = TRUE)
table(data$time.tag)
## 
## 00-06 06-12 12-18 18-00 
##  5404  3894  6565  7742
#changing variable month to season
library(plyr)
data$season <- as.factor(ifelse(data$incmonth %in% c("March", "April",
                                                     "May"), "spring",
                                ifelse(data$incmonth %in% c("June", "July",
                                                            "August"), "summer",
                                       ifelse(data$incmonth %in% c("September", "October",
                                                                   "November"), "fall",                                                                                   "winter"))))
table(data$season)
## 
##   fall spring summer winter 
##   5457   6041   5211   6896
#adding column suburb region - I created a file with each suburb and respective area
surb <- read_excel("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/data/surb.xlsx")
suburb_merge=merge(data,surb, by="locsurb")
data$suburbregio=suburb_merge$`suburb region`
table(data$suburbregio)
## 
##             Eastern Suburbs          Inner West Suburbs 
##                        9589                        9316 
##                      Center Inner North Western Suburbs 
##                        4015                           6 
##         Inner South Suburbs       north-eastern suburbs 
##                          35                          81 
##       south-eastern suburbs 
##                         563

Visualization

Prevalence of different crimes seems to be unevenly distributed with Drug offences being much more frequent:

library(ggplot2) 
qplot(data$bcsrgrp, xlab= "Crime", main = "Crimes")+scale_y_continuous("Number of crimes")

It would be interesting to look at how crimes are distributed with respect to location, time of day, day of week, and month:

qplot(data$locsurb, xlab= "Locstion", main = "Crimes over Location") + scale_y_continuous("Number of crimes")+coord_flip()

qplot(data$time.tag, xlab= "Time of day", main = "Crimes over Time") +     
  scale_y_continuous("Number of crimes")

qplot(data$incday, xlab= "Day of Week", main = "Crimes over Day of Week") +  
  scale_y_continuous("Number of crimes")

qplot(data$incmonth, xlab= "Month", main = "Crimes over Month") +scale_y_continuous("Number of crimes")

Most of the crimes occur in Sydney, Surry Hills, Potts Point and Darlinghurst.

There does seem to exist a pattern in the occurrence of crime with respect to the dimension of time. The latter part of the day, Saturdays, and January witness more crime incidents, with respect to other corresponding time periods

Variables Association Chi Square Tests

bcsrgrp season Total
fall spring summer winter
Drug Offences 2911 3338 2773 3863 12885
Theft 1264 1246 1232 1390 5132
Assault 1030 1150 929 1332 4441
Malicious Damage 51 51 58 54 214
Robbery 201 256 219 257 933
Total 5457 6041 5211 6896 23605
χ2=43.242 · df=12 · Cramer’s V=0.025 · p=0.000

There is association between type of crime and season (p-value~0) but the association is very weak( Cramer’s V = 0,025

sjt.xtab(data$bcsrgrp, data$bcsrgccde)
bcsrgrp bcsrgccde Total
Address Intersect Landmark Postcode Street Suburb
Drug Offences 1015 2821 1180 451 7277 141 12885
Theft 1235 1325 15 355 2009 193 5132
Assault 562 1108 439 273 1968 91 4441
Malicious Damage 31 54 25 11 88 5 214
Robbery 78 233 118 69 414 21 933
Total 2921 5541 1777 1159 11756 451 23605
χ2=1819.969 · df=20 · Cramer’s V=0.139 · Fisher’s p=0.000

There is association between type of crime and local (p-value~0) and the association is weak ( Cramer’s V = 0,139)

sjt.xtab(data$bcsrgrp, data$suburbregio)
bcsrgrp suburbregio Total
 Eastern Suburbs  Inner West Suburbs Center Inner North Western
Suburbs
Inner South Suburbs north-eastern
suburbs
south-eastern
suburbs
Drug Offences 7274 2729 2327 0 0 0 555 12885
Theft 1725 1764 1635 0 0 0 8 5132
Assault 439 3921 0 0 0 81 0 4441
Malicious Damage 74 87 53 0 0 0 0 214
Robbery 77 815 0 6 35 0 0 933
Total 9589 9316 4015 6 35 81 563 23605
χ2=9863.569 · df=24 · Cramer’s V=0.323 · Fisher’s p=0.000

There is association between type of crime and suburbs (p-value~0) and the association is moderate ( Cramer’s V = 0,323)

sjt.xtab(data$bcsrgrp, data$incday)
bcsrgrp incday Total
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Drug Offences 2361 892 3118 1636 2077 1160 1641 12885
Theft 813 689 871 724 687 688 660 5132
Assault 620 429 959 1023 522 452 436 4441
Malicious Damage 37 31 37 38 19 27 25 214
Robbery 133 141 162 163 107 122 105 933
Total 3964 2182 5147 3584 3412 2449 2867 23605
χ2=764.367 · df=24 · Cramer’s V=0.090 · p=0.000

There is association between type of crime and day of week (p-value~0) but the association is very weak ( Cramer’s V = 0,09)

Modeling

As you could notice, there are some suburb regions with very low frequency. These I will exclude from the analysis since they are likely to be outliers and can influence the model.

library(Hmisc)
data_subset=subset(data,suburbregio%nin%"Inner South Suburbs" & suburbregio%nin%"Inner North Western Suburbs" & suburbregio%nin%"north-eastern suburbs" & suburbregio%nin%"south-eastern suburbs")

#Spliting data into train and test dataset
n = nrow(data_subset)
n_train = round(0.90 * n)
set.seed(123) # set a random seed for reproducibility 
train_indices = sample(1:n, n_train)
train= data_subset[train_indices, ]
test = data_subset[-train_indices, ]

#Model
library(nnet)
model =multinom(bcsrgrp ~ bcsrgccde+season+incday+suburbregio, data = train)
## # weights:  90 (68 variable)
## initial  value 33199.485258 
## iter  10 value 19244.382048
## iter  20 value 19148.923039
## iter  30 value 18966.507730
## iter  40 value 18925.506163
## iter  50 value 18877.158584
## iter  60 value 18841.753676
## iter  70 value 18834.134958
## iter  80 value 18829.903720
## iter  90 value 18829.503426
## final  value 18829.502386 
## converged
#Predictions
predicted_class=predict(model,test)
table(predicted_class,test$bcsrgrp)
##                   
## predicted_class    Drug Offences Theft Assault Malicious Damage Robbery
##   Drug Offences             1011   305     135               10      15
##   Theft                       60    95      19                1       4
##   Assault                    172   130     272                7      56
##   Malicious Damage             0     0       0                0       0
##   Robbery                      0     0       0                0       0
#Accuracy
1-mean(as.character(predicted_class) != as.character(test$bcsrgrp))
## [1] 0.6012216

Model Interpretation

library(stargazer)
multi1.rrr = exp(coef(model))
stargazer(model, type="text", coef=list(multi1.rrr), p.auto=FALSE, out="resultsodds.htm")
## 
## ================================================================================
##                                               Dependent variable:               
##                                -------------------------------------------------
##                                  Theft     Assault   Malicious Damage  Robbery  
##                                   (1)        (2)           (3)           (4)    
## --------------------------------------------------------------------------------
## bcsrgccdeIntersect              0.387***   0.648***      0.549**        0.971   
##                                 (0.060)    (0.080)       (0.235)       (0.153)  
##                                                                                 
## bcsrgccdeLandmark               0.011***   0.516***       0.615*        1.094   
##                                 (0.274)    (0.100)       (0.285)       (0.173)  
##                                                                                 
## bcsrgccdePostcode               0.626***    0.983         0.778        1.803*** 
##                                 (0.092)    (0.119)       (0.358)       (0.200)  
##                                                                                 
## bcsrgccdeStreet                 0.221***   0.436***      0.347***      0.688*** 
##                                 (0.055)    (0.073)       (0.215)       (0.143)  
##                                                                                 
## bcsrgccdeSuburb                  1.033      1.205         0.889        1.979**  
##                                 (0.131)    (0.183)       (0.542)       (0.305)  
##                                                                                 
## seasonspring                     1.007     1.276***       0.965        1.658*** 
##                                 (0.054)    (0.064)       (0.211)       (0.111)  
##                                                                                 
## seasonsummer                     1.030      1.028         1.244         1.219*  
##                                 (0.055)    (0.066)       (0.201)       (0.117)  
##                                                                                 
## seasonwinter                    0.897**     1.002         0.837         1.105   
##                                 (0.053)    (0.062)       (0.206)       (0.110)  
##                                                                                 
## incdayMonday                    2.420***   1.915***      2.289***      3.146*** 
##                                 (0.074)    (0.093)       (0.260)       (0.145)  
##                                                                                 
## incdaySaturday                   0.921      1.135*        0.764         0.944   
##                                 (0.063)    (0.073)       (0.247)       (0.134)  
##                                                                                 
## incdaySunday                    1.432***   2.342***       1.509*       1.830*** 
##                                 (0.069)    (0.077)       (0.248)       (0.137)  
##                                                                                 
## incdayThursday                   1.060      0.899         0.641         0.827   
##                                 (0.068)    (0.083)       (0.294)       (0.153)  
##                                                                                 
## incdayTuesday                   1.732***   1.276***       1.459        1.541*** 
##                                 (0.072)    (0.088)       (0.267)       (0.150)  
##                                                                                 
## incdayWednesday                 1.333***    0.865*        1.111         1.084   
##                                 (0.070)    (0.088)       (0.267)       (0.152)  
##                                                                                 
## suburbregio Inner West Suburbs  2.880***  24.119***      3.139***     27.527*** 
##                                 (0.045)    (0.059)       (0.168)       (0.125)  
##                                                                                 
## suburbregioCenter               2.984***   0.00000       2.191***      0.000*** 
##                                 (0.047)    (93.270)      (0.191)       (0.000)  
##                                                                                 
## Constant                        0.585***   0.080***      0.020***      0.008*** 
##                                 (0.075)    (0.103)       (0.292)       (0.209)  
##                                                                                 
## --------------------------------------------------------------------------------
## Akaike Inf. Crit.              37,795.000 37,795.000    37,795.000    37,795.000
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

Interpretation is done taking into account the reference category (the one that is not present in the output).

The values in the output represent the relative risk ratios, which allow an easier interpretation of the logit coefficients. They are the exponentiated value of the logit coefficients. As an example:

Keeping all other variables constant, when comparing to the occurrence of Drug Offences:

*The odds of witness a Theft crime will decrease by 60% (0.387-1 x100) if being in an Intersect vs Address

*The odds of witness an Assault will decrease by 35% (0.648-1 x100) if being in an Intersect vs Address

*The odds of witness a Malicious Demage will decrease by 45% (0.569-1 x100) if being in an Intersect vs Address

7th Project: Multinomial Logistic Regression - Predicting Crime in Sydney

Sydney Crime Spot Sydney Crime Spot Mafalda Silva 14th November 2018 T...