SlideShare a Scribd company logo
1 of 1
Download to read offline
Analysis on San Francisco Crime Rate
M.Sc. Student: Jiaying Li, Supervisor: Dr. Ian McLeod
Department of Statistical and Actuarial Sciences, University of Western Ontario
•Kaggle aspires to be "The Home of Data
Science". It curates many interesting chal-
lenges for modern data analytic meth-
ods. Some challenges offer large mone-
tary awards and others are offered for in-
terested students. There are nearly half a
million registered users on this website.
Kaggle
•This dataset is available from Kaggle and
comprises more than 1.7 million records
of crimes in the city during the period
2003-2015. The response variable to be
predicted is the crime type which has 39
categories as summarized in the barchart.
•The main predictors available were date
including time and GPS. This data is an
example of "long" data.
TREA
PORNOGRAPHY/OBSCENE MAT
GAMBLING
SEX OFFENSES NON FORCIBLE
EXTORTION
BRIBERY
BAD CHECKS
FAMILY OFFENSES
SUICIDE
LOITERING
EMBEZZLEMENT
ARSON
LIQUOR LAWS
RUNAWAY
DRIVING UNDER THE INFLUENCE
KIDNAPPING
RECOVERED VEHICLE
DISORDERLY CONDUCT
DRUNKENNESS
SEX OFFENSES FORCIBLE
STOLEN PROPERTY
PROSTITUTION
TRESPASS
WEAPON LAWS
SECONDARY CODES
FORGERY/COUNTERFEITING
FRAUD
ROBBERY
MISSING PERSON
SUSPICIOUS OCC
BURGLARY
WARRANTS
VANDALISM
VEHICLE THEFT
DRUG/NARCOTIC
ASSAULT
NON−CRIMINAL
OTHER OFFENSES
LARCENY/THEFT
Frequency of Crimes in San Francisco
Count
0
50000
100000
150000
Data Description
•The raw response “Category” is highly
imbalanced, with a lowest and highest fre-
quency to be 1 and 174,588 respectively.
In the preliminary analysis, “Category”
was then aggregated and reduced to 5 lev-
els based on a common classification of
crimes in law.
Preliminary analysis
•As shown in the map and the barchart
above, crime types differ in different re-
gion, but they share similar trend along
with hour.
Visualization
•Nearest neighbour classifier, naive Bayes,
quadratic discriminant analysis
•Multinomial regression
•One-vs.-rest and one-vs.one
•Tree methods, and random forest with
gradient boosting
•Neural nets, Support vector machines
•Two-stages model as a combination of
above.
All methods are already implemented in R
in well-established R packages on CRAN.
Methods Used
•For test case, I provided both a posterior
probability for each of the 39 categories,
and a final classification on each incident.
•Logloss = − 1
N
N
i=1
39
j=1 yij log(pij),
where N is the number of images in the
test set, yij = 1 if observation i is in class j,
and 0 otherwise, pij is the predicted prob-
ability that observation i belongs to class
j.
•Classification rate = nc
N , where nc is the
number of images correctly classified.
Evaluation
•Predict raw response with 39 classes di-
rectly.
Model Variables Classification rate Logloss
naive Bayes Pd,Hr,Yr,Week,DayOfWeek 22.85% 2.56
kNN PdDistrict,Hr,Yr,X,Y 22.33% 11.17
QDA X,Y 8.48% 3.47
NNet X,Y 19.97% 2.68
SVM PdDistrict,Hr,Yr,X,Y 22.41%
Not
Applicable
C5.0 X,Y 28.07% 2.50
CART X,Y 28.01% 2.87
RF PdDistrict,Hr,Yr,X,Y 23.78% 6.39
Boosting PdDistrict,X,Y 25.34% 5.12
•Predict 5 aggregated levels only.
Model Variables Classification rate
naive Bayes PdDistrict 38.78%
kNN Pd,X,Y,Hr,Yr 30.05%
QDA X,Y 30.48%
NNet DOW,Pd,X,Y,Hr,Yr 37.92%
ovo grpreg DOW,Pd,X,Y,Hr,Yr 38.81%
ova grpreg DOW,Pd,X,Y,Hr,Yr 38.78%
SVM DOW,Pd,X,Y,Hr,Yr 37.92%
C5.0 X,Y 44.02%
CART X,Y(stratified) 43.33%
RF PdDistrict,Hr,Yr,X,Y 39.72%
Boosting X,Y 41.61%
Model outputs
•One could group the crimes into 5 cate-
gories by making each class have approx-
imately equal numbers, as far as possi-
ble, to get something more balanced. The
model first classifies incidents into 5 ag-
gregated levels, followed by the second
stage to predict for each of the 39 crime
categories.
Model Variables Classification rate Logloss
C5.0+naiveBayes PdDistrict 26.08% 2.94
C5.0+C5.0 X,Y 27.11% 2.97
C5.0+Boosting PdDistrict,Hr,Yr,X,Y 27.45% 2.95
2-stage model
•My best predictor has a classification rate
of 28% which is much better than random
guessing! I have also submitted my best
predictions using C5.0 to the Kaggle web-
site, the log-loss or entropy is used to eval-
uate the model performance. For my pre-
dictor this was 2.50, which is close to 2.26,
which is the best so far on Kaggle.
•After some trials and error, location serves
as the most important factor in all mod-
els. Time of the day is useful to some ex-
tent, but models are more likely to suffer
from overfiting and a decrease in predic-
tion power if date and time are included
as predictor variables.
Conclusion
1.D. Kahle and H. Wickham. ggmap: Spatial Visualization with
ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-
project.org/archive/2013-1/kahle-wickham.pdf
2.Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical
learning (Vol. 1). Springer, Berlin: Springer series in statistics.
References
I would like to thank San Francisco OpenData for the data
source, and Kaggle for the platform.
Acknowledgement

More Related Content

Similar to Statistical modeling in San Francisco Crime Prediction

San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSameer Darekar
 
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...Tarun Amarnath
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2AdamCribbs1
 
HunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the HoodHunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the HoodAzavea
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasMadhumita Ghosh
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slidespannicle
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922stone55
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Neural Networks for Pattern Recognition
Neural Networks for Pattern RecognitionNeural Networks for Pattern Recognition
Neural Networks for Pattern RecognitionVipra Singh
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop QuantUniversity
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptopRising Media, Inc.
 
BioVariance Research Services - Target Profile Prediction
BioVariance Research Services - Target Profile PredictionBioVariance Research Services - Target Profile Prediction
BioVariance Research Services - Target Profile PredictionJosef Scheiber
 

Similar to Statistical modeling in San Francisco Crime Prediction (20)

San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contest
 
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...
Hello Criminals! Meet Big Data: Preventing Crime in San Francisco by Predicti...
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2
 
HunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the HoodHunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the Hood
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
Di35605610
Di35605610Di35605610
Di35605610
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Neural Networks for Pattern Recognition
Neural Networks for Pattern RecognitionNeural Networks for Pattern Recognition
Neural Networks for Pattern Recognition
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
 
Nlp text classification
Nlp text classificationNlp text classification
Nlp text classification
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
BioVariance Research Services - Target Profile Prediction
BioVariance Research Services - Target Profile PredictionBioVariance Research Services - Target Profile Prediction
BioVariance Research Services - Target Profile Prediction
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 

Statistical modeling in San Francisco Crime Prediction

  • 1. Analysis on San Francisco Crime Rate M.Sc. Student: Jiaying Li, Supervisor: Dr. Ian McLeod Department of Statistical and Actuarial Sciences, University of Western Ontario •Kaggle aspires to be "The Home of Data Science". It curates many interesting chal- lenges for modern data analytic meth- ods. Some challenges offer large mone- tary awards and others are offered for in- terested students. There are nearly half a million registered users on this website. Kaggle •This dataset is available from Kaggle and comprises more than 1.7 million records of crimes in the city during the period 2003-2015. The response variable to be predicted is the crime type which has 39 categories as summarized in the barchart. •The main predictors available were date including time and GPS. This data is an example of "long" data. TREA PORNOGRAPHY/OBSCENE MAT GAMBLING SEX OFFENSES NON FORCIBLE EXTORTION BRIBERY BAD CHECKS FAMILY OFFENSES SUICIDE LOITERING EMBEZZLEMENT ARSON LIQUOR LAWS RUNAWAY DRIVING UNDER THE INFLUENCE KIDNAPPING RECOVERED VEHICLE DISORDERLY CONDUCT DRUNKENNESS SEX OFFENSES FORCIBLE STOLEN PROPERTY PROSTITUTION TRESPASS WEAPON LAWS SECONDARY CODES FORGERY/COUNTERFEITING FRAUD ROBBERY MISSING PERSON SUSPICIOUS OCC BURGLARY WARRANTS VANDALISM VEHICLE THEFT DRUG/NARCOTIC ASSAULT NON−CRIMINAL OTHER OFFENSES LARCENY/THEFT Frequency of Crimes in San Francisco Count 0 50000 100000 150000 Data Description •The raw response “Category” is highly imbalanced, with a lowest and highest fre- quency to be 1 and 174,588 respectively. In the preliminary analysis, “Category” was then aggregated and reduced to 5 lev- els based on a common classification of crimes in law. Preliminary analysis •As shown in the map and the barchart above, crime types differ in different re- gion, but they share similar trend along with hour. Visualization •Nearest neighbour classifier, naive Bayes, quadratic discriminant analysis •Multinomial regression •One-vs.-rest and one-vs.one •Tree methods, and random forest with gradient boosting •Neural nets, Support vector machines •Two-stages model as a combination of above. All methods are already implemented in R in well-established R packages on CRAN. Methods Used •For test case, I provided both a posterior probability for each of the 39 categories, and a final classification on each incident. •Logloss = − 1 N N i=1 39 j=1 yij log(pij), where N is the number of images in the test set, yij = 1 if observation i is in class j, and 0 otherwise, pij is the predicted prob- ability that observation i belongs to class j. •Classification rate = nc N , where nc is the number of images correctly classified. Evaluation •Predict raw response with 39 classes di- rectly. Model Variables Classification rate Logloss naive Bayes Pd,Hr,Yr,Week,DayOfWeek 22.85% 2.56 kNN PdDistrict,Hr,Yr,X,Y 22.33% 11.17 QDA X,Y 8.48% 3.47 NNet X,Y 19.97% 2.68 SVM PdDistrict,Hr,Yr,X,Y 22.41% Not Applicable C5.0 X,Y 28.07% 2.50 CART X,Y 28.01% 2.87 RF PdDistrict,Hr,Yr,X,Y 23.78% 6.39 Boosting PdDistrict,X,Y 25.34% 5.12 •Predict 5 aggregated levels only. Model Variables Classification rate naive Bayes PdDistrict 38.78% kNN Pd,X,Y,Hr,Yr 30.05% QDA X,Y 30.48% NNet DOW,Pd,X,Y,Hr,Yr 37.92% ovo grpreg DOW,Pd,X,Y,Hr,Yr 38.81% ova grpreg DOW,Pd,X,Y,Hr,Yr 38.78% SVM DOW,Pd,X,Y,Hr,Yr 37.92% C5.0 X,Y 44.02% CART X,Y(stratified) 43.33% RF PdDistrict,Hr,Yr,X,Y 39.72% Boosting X,Y 41.61% Model outputs •One could group the crimes into 5 cate- gories by making each class have approx- imately equal numbers, as far as possi- ble, to get something more balanced. The model first classifies incidents into 5 ag- gregated levels, followed by the second stage to predict for each of the 39 crime categories. Model Variables Classification rate Logloss C5.0+naiveBayes PdDistrict 26.08% 2.94 C5.0+C5.0 X,Y 27.11% 2.97 C5.0+Boosting PdDistrict,Hr,Yr,X,Y 27.45% 2.95 2-stage model •My best predictor has a classification rate of 28% which is much better than random guessing! I have also submitted my best predictions using C5.0 to the Kaggle web- site, the log-loss or entropy is used to eval- uate the model performance. For my pre- dictor this was 2.50, which is close to 2.26, which is the best so far on Kaggle. •After some trials and error, location serves as the most important factor in all mod- els. Time of the day is useful to some ex- tent, but models are more likely to suffer from overfiting and a decrease in predic- tion power if date and time are included as predictor variables. Conclusion 1.D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r- project.org/archive/2013-1/kahle-wickham.pdf 2.Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1). Springer, Berlin: Springer series in statistics. References I would like to thank San Francisco OpenData for the data source, and Kaggle for the platform. Acknowledgement