SlideShare a Scribd company logo
1 of 6
CRIME DATASET ANALYSIS
CITY OF CHICAGO (2001-PRESENT)
|Mining of Massive Dataset with MapReduce, Fall 2016|
By
|Stuti Deshpande (G00979218)|
|Amogh Gaikwad (G00979271|
PAGE 1
1. Introduction
1.1 Background
The project report outlines how predictive crime analysis can help assist Chicago Police
Department to prevent criminal activities in the city and ergo reduce crime rate. The Police
Department of City of Chicago strives to improve their services to reduce the crime rate and
this was the motivation behind the project. Our goal is to provide resourceful insights, which
in turn lead to reduction in crime rate.
We chose this dataset because it was very interesting and complex as well to analyze and
understand the patterns of crimes. We worked on the prediction and analysis of the data,
that could be useful for the Police Department, when deciding which areas to allocate more
resources. If they want to increase the number of arrests for a crime type, then where (area)
should they focus their efforts? Which type of crime is more prone to happen at a particular
location (location as streets, sidewalk, ATM, …).
This predictive system can be implemented as an aid to supplement the officer’s experience
and help them to prevent the occurrence of a type of crime at a location in the upcoming
year-week.
1.2 Goals
The main objectives of our project are following:
1. Classification:
1.1 To predict the probability of a type of crime occurring at a beat location (Area
code) for the upcoming week. This task was done using Random Forest
implementation for Regression.
1.2 To predict whether a crime would result into an arrest or not.
2. Clustering:
Classify the given dataset through a certain number of clusters ( assume type
of crime and location) and finding which type of crime is moreprone to happen
at a particular location type( sidewalks, street, etc), using K-means Clustering.
1.3 Dataset
The dataset “Crimes (2001-Present)” for the city of Chicago for this project has been taken
from Data.gov; from the following link:
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
The dataset instances have been collected from the year 2001 till present, and still updating.
PAGE 2
Format: csv, comma-separated
Size: 2GB
Number of rows: around 7 million
Number of Attributes: 24
1.4 Project Scope
The project scope is limited to the predictions of crime that may happen at a given location
(beat) for the upcoming year-week and whether a given type of crime results into an arrest
or not and further, identifying patterns.Analyzing on the googlemaps for the “hotspots” with
highest crime rates could be examined as part of further study of the subject matter.
2. Method
2.1 Data Pre-Processing
The crimedataset neededsomeform of DataPre-Processing such as Data Cleaning and DataNormalization.
As part of Data Cleaning, we filtered out all the attributes from the dataset that were not relevant for our
data analysis. Few of them are:
ID, Block, Case Number, Ward
The attributes of interest that were the part of our data analysis are:
• Date: the date the crime occurred. Date Time format
• Location Description: the location where the crime occurred (sidewalk,ATM)
• Primary type: the type of crime: Categorical Attribute
• Arrest: whether or not an arrest was made for the crime. Binary Attribute
• Domestic:whetheror not thecrime was a domesticcrime, meaning that it was committedagainst
a family member. Binary Attribute
• Beat: the area, or "beat" in which the crime occurred. This is the smallest regional division defined
by the Chicago police department. Categorical Attribute
• District: the police district in which the crime occurred. Each district is composed of many beats,
and are defined by the Chicago Police Department. Categorical Attribute
• Community Area, Year , Latitude and Longitude: All Categorical
We removed all the Null/NA values from the dataset. The label along with other attributes of interest were
categorical attributes. So, for the predictions, we had to convert categorical to numerical attributes and
make Labelled Points to do the prediction on the label. Since, there were multiple classes for the label,
hence we used Multi-class Classification methods to carry out prediction tasks.
2.2 Technology Used
PAGE 3
The Technology used was Excel and PySpark. The implementation is in spark (using RDDs) and coding is
done in python. The implementation was tested on Mason server on Hydra Clusters, provided by George
Mason University, Department of Computer Science.
2.3 Techniques
2.3.1 Classification
2.3.1.1 Naïve Bayes
We used the multiclass-classificationimplementationofNaïve Bayes (RDD based
API in spark.mllib) to predict whether a particular type of crime would result into an arrest or not, at the
beat level. Beat is nothing but the area code assigned by the Police Department for an area.
The reasoning behind why we chose to apply this concept are as follows:
◦ Simple multiclass classification algorithm, with the assumption of independencies
between features
◦ Computes the conditional probability distribution of each feature, given label.
◦ Applies Bayes theorem to compute conditional probability distribution of label, given
observations and use it for prediction.
The most important tuning parameter for the Naïve Bayes Method is lambda. It takes an RDD
of LabeledPoint and an optional smoothing parameter lambda as input, an optional model type parameter
(default is “multinomial”),andoutputsa NaiveBayesModel,which can be used for evaluationand prediction.
Snapshot of Prediction:
Week number: 49
Predicting the crimes next week at the beat level
[(8.0, ('214', 'HOMICIDE')), (8.0, ('212', 'DECEPTIVE PRACTICE'))]
The output shows that,say,for beat number 214, there was a Homicide criminal activity that lead to an
arrest.
We tested different values for lambda and the best model we got with the value of lambda 1.0. On testing
our model on test data set, we got the accuracy of 76.11%.
2.3.1.2 Random Forest
We used the Regression implementation of Random Forest ( RDD based API in
spark.mllib) Random forests are ensembles of decision trees. Random forests are one of the most successful
machine learning models for classification and regression. They combine many decision trees in order to reduce
the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass
classification setting, do not require feature scaling, and are ableto capture non-linearity and feature interactions.
Random forests train a set of decision trees separately, so the training can be done in parallel. The
algorithm injects randomness into the training process so that each decision tree is a bit different. Combining
the predictions from each tree reduces the variance of the predictions, improving the performance on test data.
PAGE 4
To make a prediction on a new instance, a random forest must aggregate the predictions from its set
of decision trees. We use Regression to do the Averaging. Each tree predicts a real value. The label is predicted
to be the average of the tree predictions. This reduces the variance in getting the predictions.
The two important tuning parameters are:
1. numTrees: Number of trees in the forest. Increasing the number of trees will decrease the
variance in predictions, improving the model’s test-time accuracy.
2. maxDepth: Maximum depth of each tree in the forest. Increasing the depth makes the model
more expressive and powerful.
The best model was obtained with: numTrees: 8 , maxDepth: 10
A snapshot of the prediction,the typeof crime that may happen in that beat (area) for the upcoming year-
week
Predicting Type of crime at the beat level,
Current Year = 2016
Next Week number: 49
Predicting the crimes next week at the beat level for Next Week,
[(0.43205554276664104, ('1115', 'CRIMINAL DAMAGE')),
(0.36705369702798235, ('524', 'ASSAULT')),
(0.16666666666666666, ('933', 'KIDNAPPING'))]
The output is: This is the year 2016, and this is the upcoming week i.e. week number 49. For the next
week in the city Chicago, We have predicted that, say, Kidnapping can happen in Beat number 933 with
the probability of 0.166 and Assault can happen at the Beat number 524 with the probability of 0.367.
This model is 84.2% accurate, teste on Hydra Clusters.
2.3.1.3 METHOD-3: CLUSTERING
3. DISCOVERIES: DATA VISUALIZATION
Patterns of arrests were analyzed by hour of day, day of week, month of the year and year wise, for the
time period 2001-Present. We select the five most prevalent crimes for the city:
1. Narcotics
2. Assault
3. Robbery
4. Burglary
5. Homicide
PAGE 5
It has been observed that most arrests were made during night time 7:00pm to 10:00pm while the least
were during morning 4:00am to 7:00am.
Most Arrests for Narcotics were made on Sundays. All types of crime occur evenly on every day of the
week.
Summer months shows that the whole city can be on red alert for all types of crime, during the month
May-August, the distribution has the peak for these months. And decreasing slopes at both other sides.
Winters are not much prone to criminal attacks.
From the year-wise distribution, it is quite evident that Homicides, Robbery and Burglary were
comparatively more over the years 2001-2008 with a decreasing trend and a sudden increase again in the
year 2014.Also, maximum arrests were made for the Narcotics in the year 2011 and for Assaults in the year
2014.
4.CONCLUSION
 The prediction and analysis of the data could be useful for the Police Department, when deciding
which areas to allocate more resources (this also depends on the crime type, which we have
covered in our analysis).
 If they want to increase the number of arrests for a particular crime type, then where (area)
should they focus their efforts? (The beat number, in our prediction would give an idea!)
 Which type of crime is more prone to happen at particular location (sidewalks, street, ATM, ..).
 From the Visualization, it is evident that the crimes that lead to Arrest such as Narcotics and
Assaults were the highest occurring crimes by weekly, monthly and yearly, with Homicides and
Robberies happening frequently, but not to that extent (in comparison), over the time. Also,
Assaults and Narcotics were the highest occurring crime, for the years 2011 and 2014.
5.FUTURE WORK
This project can be extended which will include Map Visualization. Utilizing the location descriptors,
latitude and longitude, given as attributes, describing the location on the map where the incident has
occurred.
We can plot the hotspots i.e. areas with high crime rates for a given type of crime in a given beat/district.
Analysis of this will help the Police Department of City of Chicago to allocate more resources to red zones
with high criminal activities, to better improve their services in those areas and to be alert for the crimes
that may happen to avoid the incidents.

More Related Content

What's hot

Network Analysis in ArcGIS
Network Analysis in ArcGISNetwork Analysis in ArcGIS
Network Analysis in ArcGISJohn Reiser
 
AIR POLLUTION MONITORING USING RS
AIR POLLUTION MONITORING USING RSAIR POLLUTION MONITORING USING RS
AIR POLLUTION MONITORING USING RSAbhiram Kanigolla
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GISEhsan Hamzei
 
Applications of GIS in Municipal Solid Waste Management
Applications of GIS in Municipal Solid Waste ManagementApplications of GIS in Municipal Solid Waste Management
Applications of GIS in Municipal Solid Waste ManagementVignesh Sekar
 
A case study on GIS application
A case study on GIS applicationA case study on GIS application
A case study on GIS applicationAxay Sharma
 
GIS - Project Planning and Implementation
GIS - Project Planning and ImplementationGIS - Project Planning and Implementation
GIS - Project Planning and ImplementationMalla Reddy University
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsArti Parab Academics
 
Spatial analysis and modeling
Spatial analysis and modelingSpatial analysis and modeling
Spatial analysis and modelingTolasa_F
 
Applications of gis in planning
Applications of gis in planningApplications of gis in planning
Applications of gis in planningKU Leuven
 
Shortest route and mst
Shortest route and mstShortest route and mst
Shortest route and mstAlona Salva
 
Application of GIS (Geographical information system)
Application of GIS (Geographical information system)Application of GIS (Geographical information system)
Application of GIS (Geographical information system)Fayaz Ahamed A P
 
Part 2 - data informasi, data spasial dan data raster (GIS)
Part 2 - data informasi, data spasial dan data raster (GIS)Part 2 - data informasi, data spasial dan data raster (GIS)
Part 2 - data informasi, data spasial dan data raster (GIS)Feri Nugroho
 
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014infosanitasi
 
Gis applications in tourism a tool for sustainable tourism
Gis applications in tourism  a tool for sustainable tourismGis applications in tourism  a tool for sustainable tourism
Gis applications in tourism a tool for sustainable tourismpankaj kumar
 
Removal of iron &; Manganese
Removal of iron &; ManganeseRemoval of iron &; Manganese
Removal of iron &; ManganeseGAURAV. H .TANDON
 

What's hot (20)

Network Analysis in ArcGIS
Network Analysis in ArcGISNetwork Analysis in ArcGIS
Network Analysis in ArcGIS
 
Pemilihan Lokasi TPA Metode Legrand
Pemilihan Lokasi TPA Metode LegrandPemilihan Lokasi TPA Metode Legrand
Pemilihan Lokasi TPA Metode Legrand
 
AIR POLLUTION MONITORING USING RS
AIR POLLUTION MONITORING USING RSAIR POLLUTION MONITORING USING RS
AIR POLLUTION MONITORING USING RS
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 
L 13 grit chamber
L 13 grit chamberL 13 grit chamber
L 13 grit chamber
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 
Applications of GIS in Municipal Solid Waste Management
Applications of GIS in Municipal Solid Waste ManagementApplications of GIS in Municipal Solid Waste Management
Applications of GIS in Municipal Solid Waste Management
 
A case study on GIS application
A case study on GIS applicationA case study on GIS application
A case study on GIS application
 
GIS - Project Planning and Implementation
GIS - Project Planning and ImplementationGIS - Project Planning and Implementation
GIS - Project Planning and Implementation
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
 
Spatial analysis and modeling
Spatial analysis and modelingSpatial analysis and modeling
Spatial analysis and modeling
 
Applications of gis in planning
Applications of gis in planningApplications of gis in planning
Applications of gis in planning
 
Shortest route and mst
Shortest route and mstShortest route and mst
Shortest route and mst
 
Application of GIS (Geographical information system)
Application of GIS (Geographical information system)Application of GIS (Geographical information system)
Application of GIS (Geographical information system)
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 
Part 2 - data informasi, data spasial dan data raster (GIS)
Part 2 - data informasi, data spasial dan data raster (GIS)Part 2 - data informasi, data spasial dan data raster (GIS)
Part 2 - data informasi, data spasial dan data raster (GIS)
 
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014
Pedoman Penyusunan Strategi Sanitasi Kabupaten/Kota (SSK) 2014
 
Gis applications in tourism a tool for sustainable tourism
Gis applications in tourism  a tool for sustainable tourismGis applications in tourism  a tool for sustainable tourism
Gis applications in tourism a tool for sustainable tourism
 
Removal of iron &; Manganese
Removal of iron &; ManganeseRemoval of iron &; Manganese
Removal of iron &; Manganese
 
Gis in banking (1) final
Gis in banking (1) finalGis in banking (1) final
Gis in banking (1) final
 

Similar to Crime Dataset Analysis for City of Chicago

San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction ReportRohit Dandona
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSameer Darekar
 
Analysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduceAnalysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduceKaushik Rajan
 
LokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReportLokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReportlokesh shanmuganandam
 
IRJET - Crime Analysis and Prediction - by using DBSCAN Algorithm
IRJET -  	  Crime Analysis and Prediction - by using DBSCAN AlgorithmIRJET -  	  Crime Analysis and Prediction - by using DBSCAN Algorithm
IRJET - Crime Analysis and Prediction - by using DBSCAN AlgorithmIRJET Journal
 
9th may net sci presentation (1)
9th may net sci presentation (1)9th may net sci presentation (1)
9th may net sci presentation (1)Rajath Mahesh
 
Chicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalChicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalAashri Tandon
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGIRJET Journal
 
IRJET- Cyber Crime Attack Prediction
IRJET- Cyber Crime Attack PredictionIRJET- Cyber Crime Attack Prediction
IRJET- Cyber Crime Attack PredictionIRJET Journal
 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesHeta Parekh
 
Predictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RatePredictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RateIRJET Journal
 
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITY
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITYSENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITY
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITYSangeetha Mam
 
Crime Data Analysis, Visualization and Prediction using Data Mining
Crime Data Analysis, Visualization and Prediction using Data MiningCrime Data Analysis, Visualization and Prediction using Data Mining
Crime Data Analysis, Visualization and Prediction using Data MiningAnavadya Shibu
 
Predictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RatePredictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RateIRJET Journal
 
A Survey on Data Mining Techniques for Crime Hotspots Prediction
A Survey on Data Mining Techniques for Crime Hotspots PredictionA Survey on Data Mining Techniques for Crime Hotspots Prediction
A Survey on Data Mining Techniques for Crime Hotspots PredictionIJSRD
 

Similar to Crime Dataset Analysis for City of Chicago (20)

San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction Report
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contest
 
Analysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduceAnalysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduce
 
LokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReportLokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReport
 
IRJET - Crime Analysis and Prediction - by using DBSCAN Algorithm
IRJET -  	  Crime Analysis and Prediction - by using DBSCAN AlgorithmIRJET -  	  Crime Analysis and Prediction - by using DBSCAN Algorithm
IRJET - Crime Analysis and Prediction - by using DBSCAN Algorithm
 
Netsci
NetsciNetsci
Netsci
 
9th may net sci presentation (1)
9th may net sci presentation (1)9th may net sci presentation (1)
9th may net sci presentation (1)
 
Chicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalChicago Crime Dataset Project Proposal
Chicago Crime Dataset Project Proposal
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
 
IRJET- Cyber Crime Attack Prediction
IRJET- Cyber Crime Attack PredictionIRJET- Cyber Crime Attack Prediction
IRJET- Cyber Crime Attack Prediction
 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los Angeles
 
Technical Seminar
Technical SeminarTechnical Seminar
Technical Seminar
 
Predictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RatePredictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime Rate
 
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITY
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITYSENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITY
SENTIMENT ANALYSIS AND GEOGRAPHICAL ANALYSIS FOR ENHANCING SECURITY
 
Kyung Kim
Kyung KimKyung Kim
Kyung Kim
 
Crime Data Analysis, Visualization and Prediction using Data Mining
Crime Data Analysis, Visualization and Prediction using Data MiningCrime Data Analysis, Visualization and Prediction using Data Mining
Crime Data Analysis, Visualization and Prediction using Data Mining
 
Predictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime RatePredictive Modeling for Topographical Analysis of Crime Rate
Predictive Modeling for Topographical Analysis of Crime Rate
 
Bs4301396400
Bs4301396400Bs4301396400
Bs4301396400
 
A Survey on Data Mining Techniques for Crime Hotspots Prediction
A Survey on Data Mining Techniques for Crime Hotspots PredictionA Survey on Data Mining Techniques for Crime Hotspots Prediction
A Survey on Data Mining Techniques for Crime Hotspots Prediction
 
Final presentation
Final presentationFinal presentation
Final presentation
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

Crime Dataset Analysis for City of Chicago

  • 1. CRIME DATASET ANALYSIS CITY OF CHICAGO (2001-PRESENT) |Mining of Massive Dataset with MapReduce, Fall 2016| By |Stuti Deshpande (G00979218)| |Amogh Gaikwad (G00979271|
  • 2. PAGE 1 1. Introduction 1.1 Background The project report outlines how predictive crime analysis can help assist Chicago Police Department to prevent criminal activities in the city and ergo reduce crime rate. The Police Department of City of Chicago strives to improve their services to reduce the crime rate and this was the motivation behind the project. Our goal is to provide resourceful insights, which in turn lead to reduction in crime rate. We chose this dataset because it was very interesting and complex as well to analyze and understand the patterns of crimes. We worked on the prediction and analysis of the data, that could be useful for the Police Department, when deciding which areas to allocate more resources. If they want to increase the number of arrests for a crime type, then where (area) should they focus their efforts? Which type of crime is more prone to happen at a particular location (location as streets, sidewalk, ATM, …). This predictive system can be implemented as an aid to supplement the officer’s experience and help them to prevent the occurrence of a type of crime at a location in the upcoming year-week. 1.2 Goals The main objectives of our project are following: 1. Classification: 1.1 To predict the probability of a type of crime occurring at a beat location (Area code) for the upcoming week. This task was done using Random Forest implementation for Regression. 1.2 To predict whether a crime would result into an arrest or not. 2. Clustering: Classify the given dataset through a certain number of clusters ( assume type of crime and location) and finding which type of crime is moreprone to happen at a particular location type( sidewalks, street, etc), using K-means Clustering. 1.3 Dataset The dataset “Crimes (2001-Present)” for the city of Chicago for this project has been taken from Data.gov; from the following link: https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 The dataset instances have been collected from the year 2001 till present, and still updating.
  • 3. PAGE 2 Format: csv, comma-separated Size: 2GB Number of rows: around 7 million Number of Attributes: 24 1.4 Project Scope The project scope is limited to the predictions of crime that may happen at a given location (beat) for the upcoming year-week and whether a given type of crime results into an arrest or not and further, identifying patterns.Analyzing on the googlemaps for the “hotspots” with highest crime rates could be examined as part of further study of the subject matter. 2. Method 2.1 Data Pre-Processing The crimedataset neededsomeform of DataPre-Processing such as Data Cleaning and DataNormalization. As part of Data Cleaning, we filtered out all the attributes from the dataset that were not relevant for our data analysis. Few of them are: ID, Block, Case Number, Ward The attributes of interest that were the part of our data analysis are: • Date: the date the crime occurred. Date Time format • Location Description: the location where the crime occurred (sidewalk,ATM) • Primary type: the type of crime: Categorical Attribute • Arrest: whether or not an arrest was made for the crime. Binary Attribute • Domestic:whetheror not thecrime was a domesticcrime, meaning that it was committedagainst a family member. Binary Attribute • Beat: the area, or "beat" in which the crime occurred. This is the smallest regional division defined by the Chicago police department. Categorical Attribute • District: the police district in which the crime occurred. Each district is composed of many beats, and are defined by the Chicago Police Department. Categorical Attribute • Community Area, Year , Latitude and Longitude: All Categorical We removed all the Null/NA values from the dataset. The label along with other attributes of interest were categorical attributes. So, for the predictions, we had to convert categorical to numerical attributes and make Labelled Points to do the prediction on the label. Since, there were multiple classes for the label, hence we used Multi-class Classification methods to carry out prediction tasks. 2.2 Technology Used
  • 4. PAGE 3 The Technology used was Excel and PySpark. The implementation is in spark (using RDDs) and coding is done in python. The implementation was tested on Mason server on Hydra Clusters, provided by George Mason University, Department of Computer Science. 2.3 Techniques 2.3.1 Classification 2.3.1.1 Naïve Bayes We used the multiclass-classificationimplementationofNaïve Bayes (RDD based API in spark.mllib) to predict whether a particular type of crime would result into an arrest or not, at the beat level. Beat is nothing but the area code assigned by the Police Department for an area. The reasoning behind why we chose to apply this concept are as follows: ◦ Simple multiclass classification algorithm, with the assumption of independencies between features ◦ Computes the conditional probability distribution of each feature, given label. ◦ Applies Bayes theorem to compute conditional probability distribution of label, given observations and use it for prediction. The most important tuning parameter for the Naïve Bayes Method is lambda. It takes an RDD of LabeledPoint and an optional smoothing parameter lambda as input, an optional model type parameter (default is “multinomial”),andoutputsa NaiveBayesModel,which can be used for evaluationand prediction. Snapshot of Prediction: Week number: 49 Predicting the crimes next week at the beat level [(8.0, ('214', 'HOMICIDE')), (8.0, ('212', 'DECEPTIVE PRACTICE'))] The output shows that,say,for beat number 214, there was a Homicide criminal activity that lead to an arrest. We tested different values for lambda and the best model we got with the value of lambda 1.0. On testing our model on test data set, we got the accuracy of 76.11%. 2.3.1.2 Random Forest We used the Regression implementation of Random Forest ( RDD based API in spark.mllib) Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are ableto capture non-linearity and feature interactions. Random forests train a set of decision trees separately, so the training can be done in parallel. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predictions, improving the performance on test data.
  • 5. PAGE 4 To make a prediction on a new instance, a random forest must aggregate the predictions from its set of decision trees. We use Regression to do the Averaging. Each tree predicts a real value. The label is predicted to be the average of the tree predictions. This reduces the variance in getting the predictions. The two important tuning parameters are: 1. numTrees: Number of trees in the forest. Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. 2. maxDepth: Maximum depth of each tree in the forest. Increasing the depth makes the model more expressive and powerful. The best model was obtained with: numTrees: 8 , maxDepth: 10 A snapshot of the prediction,the typeof crime that may happen in that beat (area) for the upcoming year- week Predicting Type of crime at the beat level, Current Year = 2016 Next Week number: 49 Predicting the crimes next week at the beat level for Next Week, [(0.43205554276664104, ('1115', 'CRIMINAL DAMAGE')), (0.36705369702798235, ('524', 'ASSAULT')), (0.16666666666666666, ('933', 'KIDNAPPING'))] The output is: This is the year 2016, and this is the upcoming week i.e. week number 49. For the next week in the city Chicago, We have predicted that, say, Kidnapping can happen in Beat number 933 with the probability of 0.166 and Assault can happen at the Beat number 524 with the probability of 0.367. This model is 84.2% accurate, teste on Hydra Clusters. 2.3.1.3 METHOD-3: CLUSTERING 3. DISCOVERIES: DATA VISUALIZATION Patterns of arrests were analyzed by hour of day, day of week, month of the year and year wise, for the time period 2001-Present. We select the five most prevalent crimes for the city: 1. Narcotics 2. Assault 3. Robbery 4. Burglary 5. Homicide
  • 6. PAGE 5 It has been observed that most arrests were made during night time 7:00pm to 10:00pm while the least were during morning 4:00am to 7:00am. Most Arrests for Narcotics were made on Sundays. All types of crime occur evenly on every day of the week. Summer months shows that the whole city can be on red alert for all types of crime, during the month May-August, the distribution has the peak for these months. And decreasing slopes at both other sides. Winters are not much prone to criminal attacks. From the year-wise distribution, it is quite evident that Homicides, Robbery and Burglary were comparatively more over the years 2001-2008 with a decreasing trend and a sudden increase again in the year 2014.Also, maximum arrests were made for the Narcotics in the year 2011 and for Assaults in the year 2014. 4.CONCLUSION  The prediction and analysis of the data could be useful for the Police Department, when deciding which areas to allocate more resources (this also depends on the crime type, which we have covered in our analysis).  If they want to increase the number of arrests for a particular crime type, then where (area) should they focus their efforts? (The beat number, in our prediction would give an idea!)  Which type of crime is more prone to happen at particular location (sidewalks, street, ATM, ..).  From the Visualization, it is evident that the crimes that lead to Arrest such as Narcotics and Assaults were the highest occurring crimes by weekly, monthly and yearly, with Homicides and Robberies happening frequently, but not to that extent (in comparison), over the time. Also, Assaults and Narcotics were the highest occurring crime, for the years 2011 and 2014. 5.FUTURE WORK This project can be extended which will include Map Visualization. Utilizing the location descriptors, latitude and longitude, given as attributes, describing the location on the map where the incident has occurred. We can plot the hotspots i.e. areas with high crime rates for a given type of crime in a given beat/district. Analysis of this will help the Police Department of City of Chicago to allocate more resources to red zones with high criminal activities, to better improve their services in those areas and to be alert for the crimes that may happen to avoid the incidents.