Sydney Crime Spot
Mafalda Silva
14th November 2018
Today I will perform an analysis on a dataset related to crime in Sydney. The dataset reflect criminal incidents recorded by the NSW Police Force on their Computerised Operational Policing System (COPS) between January 2013 and March 2016.
The objective of the analysis is predict the type of crime based on local where it ocurred (Intersect, Street,…), day of week, season and suburb.
Variables description:
*Bcsrgrp: top level offence category
*Locsurb: suburb where the incident occurred
*Bcsrgccde: accuracy of the coordinates of the location of the incident
*Incmonth: the month the event occurred in
*Incday: the day of the week the incident occurred on
I will start with a exploratory analysis of the main variables that will be used and after, I will run a Multinomial Logistic Regression to predict type of crime, and give an example of the interpretation of the results obtained.
Set working directory to the folder where I have downloaded the datasets
setwd("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/output")
Importing Data
library(readxl)
outdoor_crime_data= read_excel("outdoor-crime-data.xlsx",col_types = c("numeric", "numeric", "text",
"text", "text", "text", "text", "numeric",
"numeric", "numeric", "text", "numeric",
"text", "text", "date", "numeric",
"text", "text", "text", "numeric"))
data=outdoor_crime_data
Setting variables to factors as they are categorical
data$locsurb=as.factor(data$locsurb)
data$incday=as.factor(data$incday)
#Here i set the levels of factor ordered first by Drug Offences since this is the type of crime that ocurrs more often and I want to perform the model with this level as comparison level
data$bcsrgrp=factor(data$bcsrgrp,levels=c("Drug Offences","Theft","Assault","Malicious Damage","Robbery"))
data$incmonth=as.factor(data$incmonth)
data$incday=as.factor(data$incday)
Data Featuring
#Setting time tags for times (set intervals of 6 hours)
library(chron)
data$incsttm <- times(format(data$incsttm, "%H:%M:%S"))
time.tag = chron(times= c("00:00:00", "06:00:00", "12:00:00", "18:00:00","23:59:00"))
data$time.tag = cut(data$incsttm, breaks= time.tag,labels = c("00-06","06-12","12-18","18-00"),include.lowest = TRUE)
table(data$time.tag)
##
## 00-06 06-12 12-18 18-00
## 5404 3894 6565 7742
#changing variable month to season
library(plyr)
data$season <- as.factor(ifelse(data$incmonth %in% c("March", "April",
"May"), "spring",
ifelse(data$incmonth %in% c("June", "July",
"August"), "summer",
ifelse(data$incmonth %in% c("September", "October",
"November"), "fall", "winter"))))
table(data$season)
##
## fall spring summer winter
## 5457 6041 5211 6896
#adding column suburb region - I created a file with each suburb and respective area
surb <- read_excel("C:/Users/Mafalda/Desktop/p-sydney-crime-spot-master/p-d4dsydney-hotspot/data/surb.xlsx")
suburb_merge=merge(data,surb, by="locsurb")
data$suburbregio=suburb_merge$`suburb region`
table(data$suburbregio)
##
## Eastern Suburbs Inner West Suburbs
## 9589 9316
## Center Inner North Western Suburbs
## 4015 6
## Inner South Suburbs north-eastern suburbs
## 35 81
## south-eastern suburbs
## 563
Visualization
Prevalence of different crimes seems to be unevenly distributed with Drug offences being much more frequent:
library(ggplot2)
qplot(data$bcsrgrp, xlab= "Crime", main = "Crimes")+scale_y_continuous("Number of crimes")
It would be interesting to look at how crimes are distributed with respect to location, time of day, day of week, and month:
qplot(data$locsurb, xlab= "Locstion", main = "Crimes over Location") + scale_y_continuous("Number of crimes")+coord_flip()
qplot(data$time.tag, xlab= "Time of day", main = "Crimes over Time") +
scale_y_continuous("Number of crimes")
qplot(data$incday, xlab= "Day of Week", main = "Crimes over Day of Week") +
scale_y_continuous("Number of crimes")
qplot(data$incmonth, xlab= "Month", main = "Crimes over Month") +scale_y_continuous("Number of crimes")
Most of the crimes occur in Sydney, Surry Hills, Potts Point and Darlinghurst.
There does seem to exist a pattern in the occurrence of crime with respect to the dimension of time. The latter part of the day, Saturdays, and January witness more crime incidents, with respect to other corresponding time periods
Variables Association Chi Square Tests
bcsrgrp | season | Total | |||
---|---|---|---|---|---|
fall | spring | summer | winter | ||
Drug Offences | 2911 | 3338 | 2773 | 3863 | 12885 |
Theft | 1264 | 1246 | 1232 | 1390 | 5132 |
Assault | 1030 | 1150 | 929 | 1332 | 4441 |
Malicious Damage | 51 | 51 | 58 | 54 | 214 |
Robbery | 201 | 256 | 219 | 257 | 933 |
Total | 5457 | 6041 | 5211 | 6896 | 23605 | χ2=43.242 · df=12 · Cramer’s V=0.025 · p=0.000 |
There is association between type of crime and season (p-value~0) but the association is very weak( Cramer’s V = 0,025
sjt.xtab(data$bcsrgrp, data$bcsrgccde)
bcsrgrp | bcsrgccde | Total | |||||
---|---|---|---|---|---|---|---|
Address | Intersect | Landmark | Postcode | Street | Suburb | ||
Drug Offences | 1015 | 2821 | 1180 | 451 | 7277 | 141 | 12885 |
Theft | 1235 | 1325 | 15 | 355 | 2009 | 193 | 5132 |
Assault | 562 | 1108 | 439 | 273 | 1968 | 91 | 4441 |
Malicious Damage | 31 | 54 | 25 | 11 | 88 | 5 | 214 |
Robbery | 78 | 233 | 118 | 69 | 414 | 21 | 933 |
Total | 2921 | 5541 | 1777 | 1159 | 11756 | 451 | 23605 | χ2=1819.969 · df=20 · Cramer’s V=0.139 · Fisher’s p=0.000 |
There is association between type of crime and local (p-value~0) and the association is weak ( Cramer’s V = 0,139)
sjt.xtab(data$bcsrgrp, data$suburbregio)
bcsrgrp | suburbregio | Total | ||||||
---|---|---|---|---|---|---|---|---|
 Eastern Suburbs |  Inner West Suburbs | Center |
Inner North Western Suburbs |
Inner South Suburbs |
north-eastern suburbs |
south-eastern suburbs |
||
Drug Offences | 7274 | 2729 | 2327 | 0 | 0 | 0 | 555 | 12885 |
Theft | 1725 | 1764 | 1635 | 0 | 0 | 0 | 8 | 5132 |
Assault | 439 | 3921 | 0 | 0 | 0 | 81 | 0 | 4441 |
Malicious Damage | 74 | 87 | 53 | 0 | 0 | 0 | 0 | 214 |
Robbery | 77 | 815 | 0 | 6 | 35 | 0 | 0 | 933 |
Total | 9589 | 9316 | 4015 | 6 | 35 | 81 | 563 | 23605 | χ2=9863.569 · df=24 · Cramer’s V=0.323 · Fisher’s p=0.000 |
There is association between type of crime and suburbs (p-value~0) and the association is moderate ( Cramer’s V = 0,323)
sjt.xtab(data$bcsrgrp, data$incday)
bcsrgrp | incday | Total | ||||||
---|---|---|---|---|---|---|---|---|
Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | ||
Drug Offences | 2361 | 892 | 3118 | 1636 | 2077 | 1160 | 1641 | 12885 |
Theft | 813 | 689 | 871 | 724 | 687 | 688 | 660 | 5132 |
Assault | 620 | 429 | 959 | 1023 | 522 | 452 | 436 | 4441 |
Malicious Damage | 37 | 31 | 37 | 38 | 19 | 27 | 25 | 214 |
Robbery | 133 | 141 | 162 | 163 | 107 | 122 | 105 | 933 |
Total | 3964 | 2182 | 5147 | 3584 | 3412 | 2449 | 2867 | 23605 | χ2=764.367 · df=24 · Cramer’s V=0.090 · p=0.000 |
There is association between type of crime and day of week (p-value~0) but the association is very weak ( Cramer’s V = 0,09)
Modeling
As you could notice, there are some suburb regions with very low frequency. These I will exclude from the analysis since they are likely to be outliers and can influence the model.
library(Hmisc)
data_subset=subset(data,suburbregio%nin%"Inner South Suburbs" & suburbregio%nin%"Inner North Western Suburbs" & suburbregio%nin%"north-eastern suburbs" & suburbregio%nin%"south-eastern suburbs")
#Spliting data into train and test dataset
n = nrow(data_subset)
n_train = round(0.90 * n)
set.seed(123) # set a random seed for reproducibility
train_indices = sample(1:n, n_train)
train= data_subset[train_indices, ]
test = data_subset[-train_indices, ]
#Model
library(nnet)
model =multinom(bcsrgrp ~ bcsrgccde+season+incday+suburbregio, data = train)
## # weights: 90 (68 variable)
## initial value 33199.485258
## iter 10 value 19244.382048
## iter 20 value 19148.923039
## iter 30 value 18966.507730
## iter 40 value 18925.506163
## iter 50 value 18877.158584
## iter 60 value 18841.753676
## iter 70 value 18834.134958
## iter 80 value 18829.903720
## iter 90 value 18829.503426
## final value 18829.502386
## converged
#Predictions
predicted_class=predict(model,test)
table(predicted_class,test$bcsrgrp)
##
## predicted_class Drug Offences Theft Assault Malicious Damage Robbery
## Drug Offences 1011 305 135 10 15
## Theft 60 95 19 1 4
## Assault 172 130 272 7 56
## Malicious Damage 0 0 0 0 0
## Robbery 0 0 0 0 0
#Accuracy
1-mean(as.character(predicted_class) != as.character(test$bcsrgrp))
## [1] 0.6012216
Model Interpretation
library(stargazer)
multi1.rrr = exp(coef(model))
stargazer(model, type="text", coef=list(multi1.rrr), p.auto=FALSE, out="resultsodds.htm")
##
## ================================================================================
## Dependent variable:
## -------------------------------------------------
## Theft Assault Malicious Damage Robbery
## (1) (2) (3) (4)
## --------------------------------------------------------------------------------
## bcsrgccdeIntersect 0.387*** 0.648*** 0.549** 0.971
## (0.060) (0.080) (0.235) (0.153)
##
## bcsrgccdeLandmark 0.011*** 0.516*** 0.615* 1.094
## (0.274) (0.100) (0.285) (0.173)
##
## bcsrgccdePostcode 0.626*** 0.983 0.778 1.803***
## (0.092) (0.119) (0.358) (0.200)
##
## bcsrgccdeStreet 0.221*** 0.436*** 0.347*** 0.688***
## (0.055) (0.073) (0.215) (0.143)
##
## bcsrgccdeSuburb 1.033 1.205 0.889 1.979**
## (0.131) (0.183) (0.542) (0.305)
##
## seasonspring 1.007 1.276*** 0.965 1.658***
## (0.054) (0.064) (0.211) (0.111)
##
## seasonsummer 1.030 1.028 1.244 1.219*
## (0.055) (0.066) (0.201) (0.117)
##
## seasonwinter 0.897** 1.002 0.837 1.105
## (0.053) (0.062) (0.206) (0.110)
##
## incdayMonday 2.420*** 1.915*** 2.289*** 3.146***
## (0.074) (0.093) (0.260) (0.145)
##
## incdaySaturday 0.921 1.135* 0.764 0.944
## (0.063) (0.073) (0.247) (0.134)
##
## incdaySunday 1.432*** 2.342*** 1.509* 1.830***
## (0.069) (0.077) (0.248) (0.137)
##
## incdayThursday 1.060 0.899 0.641 0.827
## (0.068) (0.083) (0.294) (0.153)
##
## incdayTuesday 1.732*** 1.276*** 1.459 1.541***
## (0.072) (0.088) (0.267) (0.150)
##
## incdayWednesday 1.333*** 0.865* 1.111 1.084
## (0.070) (0.088) (0.267) (0.152)
##
## suburbregio Inner West Suburbs 2.880*** 24.119*** 3.139*** 27.527***
## (0.045) (0.059) (0.168) (0.125)
##
## suburbregioCenter 2.984*** 0.00000 2.191*** 0.000***
## (0.047) (93.270) (0.191) (0.000)
##
## Constant 0.585*** 0.080*** 0.020*** 0.008***
## (0.075) (0.103) (0.292) (0.209)
##
## --------------------------------------------------------------------------------
## Akaike Inf. Crit. 37,795.000 37,795.000 37,795.000 37,795.000
## ================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation is done taking into account the reference category (the one that is not present in the output).
The values in the output represent the relative risk ratios, which allow an easier interpretation of the logit coefficients. They are the exponentiated value of the logit coefficients. As an example:
Keeping all other variables constant, when comparing to the occurrence of Drug Offences:
*The odds of witness a Theft crime will decrease by 60% (0.387-1 x100) if being in an Intersect vs Address
*The odds of witness an Assault will decrease by 35% (0.648-1 x100) if being in an Intersect vs Address
*The odds of witness a Malicious Demage will decrease by 45% (0.569-1 x100) if being in an Intersect vs Address
No comments:
Post a Comment