1. CRIME DATASET ANALYSIS
CITY OF CHICAGO (2001-PRESENT)
|Mining of Massive Dataset with MapReduce, Fall 2016|
By
|Stuti Deshpande (G00979218)|
|Amogh Gaikwad (G00979271|
2. PAGE 1
1. Introduction
1.1 Background
The project report outlines how predictive crime analysis can help assist Chicago Police
Department to prevent criminal activities in the city and ergo reduce crime rate. The Police
Department of City of Chicago strives to improve their services to reduce the crime rate and
this was the motivation behind the project. Our goal is to provide resourceful insights, which
in turn lead to reduction in crime rate.
We chose this dataset because it was very interesting and complex as well to analyze and
understand the patterns of crimes. We worked on the prediction and analysis of the data,
that could be useful for the Police Department, when deciding which areas to allocate more
resources. If they want to increase the number of arrests for a crime type, then where (area)
should they focus their efforts? Which type of crime is more prone to happen at a particular
location (location as streets, sidewalk, ATM, …).
This predictive system can be implemented as an aid to supplement the officer’s experience
and help them to prevent the occurrence of a type of crime at a location in the upcoming
year-week.
1.2 Goals
The main objectives of our project are following:
1. Classification:
1.1 To predict the probability of a type of crime occurring at a beat location (Area
code) for the upcoming week. This task was done using Random Forest
implementation for Regression.
1.2 To predict whether a crime would result into an arrest or not.
2. Clustering:
Classify the given dataset through a certain number of clusters ( assume type
of crime and location) and finding which type of crime is moreprone to happen
at a particular location type( sidewalks, street, etc), using K-means Clustering.
1.3 Dataset
The dataset “Crimes (2001-Present)” for the city of Chicago for this project has been taken
from Data.gov; from the following link:
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
The dataset instances have been collected from the year 2001 till present, and still updating.
3. PAGE 2
Format: csv, comma-separated
Size: 2GB
Number of rows: around 7 million
Number of Attributes: 24
1.4 Project Scope
The project scope is limited to the predictions of crime that may happen at a given location
(beat) for the upcoming year-week and whether a given type of crime results into an arrest
or not and further, identifying patterns.Analyzing on the googlemaps for the “hotspots” with
highest crime rates could be examined as part of further study of the subject matter.
2. Method
2.1 Data Pre-Processing
The crimedataset neededsomeform of DataPre-Processing such as Data Cleaning and DataNormalization.
As part of Data Cleaning, we filtered out all the attributes from the dataset that were not relevant for our
data analysis. Few of them are:
ID, Block, Case Number, Ward
The attributes of interest that were the part of our data analysis are:
• Date: the date the crime occurred. Date Time format
• Location Description: the location where the crime occurred (sidewalk,ATM)
• Primary type: the type of crime: Categorical Attribute
• Arrest: whether or not an arrest was made for the crime. Binary Attribute
• Domestic:whetheror not thecrime was a domesticcrime, meaning that it was committedagainst
a family member. Binary Attribute
• Beat: the area, or "beat" in which the crime occurred. This is the smallest regional division defined
by the Chicago police department. Categorical Attribute
• District: the police district in which the crime occurred. Each district is composed of many beats,
and are defined by the Chicago Police Department. Categorical Attribute
• Community Area, Year , Latitude and Longitude: All Categorical
We removed all the Null/NA values from the dataset. The label along with other attributes of interest were
categorical attributes. So, for the predictions, we had to convert categorical to numerical attributes and
make Labelled Points to do the prediction on the label. Since, there were multiple classes for the label,
hence we used Multi-class Classification methods to carry out prediction tasks.
2.2 Technology Used
4. PAGE 3
The Technology used was Excel and PySpark. The implementation is in spark (using RDDs) and coding is
done in python. The implementation was tested on Mason server on Hydra Clusters, provided by George
Mason University, Department of Computer Science.
2.3 Techniques
2.3.1 Classification
2.3.1.1 Naïve Bayes
We used the multiclass-classificationimplementationofNaïve Bayes (RDD based
API in spark.mllib) to predict whether a particular type of crime would result into an arrest or not, at the
beat level. Beat is nothing but the area code assigned by the Police Department for an area.
The reasoning behind why we chose to apply this concept are as follows:
◦ Simple multiclass classification algorithm, with the assumption of independencies
between features
◦ Computes the conditional probability distribution of each feature, given label.
◦ Applies Bayes theorem to compute conditional probability distribution of label, given
observations and use it for prediction.
The most important tuning parameter for the Naïve Bayes Method is lambda. It takes an RDD
of LabeledPoint and an optional smoothing parameter lambda as input, an optional model type parameter
(default is “multinomial”),andoutputsa NaiveBayesModel,which can be used for evaluationand prediction.
Snapshot of Prediction:
Week number: 49
Predicting the crimes next week at the beat level
[(8.0, ('214', 'HOMICIDE')), (8.0, ('212', 'DECEPTIVE PRACTICE'))]
The output shows that,say,for beat number 214, there was a Homicide criminal activity that lead to an
arrest.
We tested different values for lambda and the best model we got with the value of lambda 1.0. On testing
our model on test data set, we got the accuracy of 76.11%.
2.3.1.2 Random Forest
We used the Regression implementation of Random Forest ( RDD based API in
spark.mllib) Random forests are ensembles of decision trees. Random forests are one of the most successful
machine learning models for classification and regression. They combine many decision trees in order to reduce
the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass
classification setting, do not require feature scaling, and are ableto capture non-linearity and feature interactions.
Random forests train a set of decision trees separately, so the training can be done in parallel. The
algorithm injects randomness into the training process so that each decision tree is a bit different. Combining
the predictions from each tree reduces the variance of the predictions, improving the performance on test data.
5. PAGE 4
To make a prediction on a new instance, a random forest must aggregate the predictions from its set
of decision trees. We use Regression to do the Averaging. Each tree predicts a real value. The label is predicted
to be the average of the tree predictions. This reduces the variance in getting the predictions.
The two important tuning parameters are:
1. numTrees: Number of trees in the forest. Increasing the number of trees will decrease the
variance in predictions, improving the model’s test-time accuracy.
2. maxDepth: Maximum depth of each tree in the forest. Increasing the depth makes the model
more expressive and powerful.
The best model was obtained with: numTrees: 8 , maxDepth: 10
A snapshot of the prediction,the typeof crime that may happen in that beat (area) for the upcoming year-
week
Predicting Type of crime at the beat level,
Current Year = 2016
Next Week number: 49
Predicting the crimes next week at the beat level for Next Week,
[(0.43205554276664104, ('1115', 'CRIMINAL DAMAGE')),
(0.36705369702798235, ('524', 'ASSAULT')),
(0.16666666666666666, ('933', 'KIDNAPPING'))]
The output is: This is the year 2016, and this is the upcoming week i.e. week number 49. For the next
week in the city Chicago, We have predicted that, say, Kidnapping can happen in Beat number 933 with
the probability of 0.166 and Assault can happen at the Beat number 524 with the probability of 0.367.
This model is 84.2% accurate, teste on Hydra Clusters.
2.3.1.3 METHOD-3: CLUSTERING
3. DISCOVERIES: DATA VISUALIZATION
Patterns of arrests were analyzed by hour of day, day of week, month of the year and year wise, for the
time period 2001-Present. We select the five most prevalent crimes for the city:
1. Narcotics
2. Assault
3. Robbery
4. Burglary
5. Homicide
6. PAGE 5
It has been observed that most arrests were made during night time 7:00pm to 10:00pm while the least
were during morning 4:00am to 7:00am.
Most Arrests for Narcotics were made on Sundays. All types of crime occur evenly on every day of the
week.
Summer months shows that the whole city can be on red alert for all types of crime, during the month
May-August, the distribution has the peak for these months. And decreasing slopes at both other sides.
Winters are not much prone to criminal attacks.
From the year-wise distribution, it is quite evident that Homicides, Robbery and Burglary were
comparatively more over the years 2001-2008 with a decreasing trend and a sudden increase again in the
year 2014.Also, maximum arrests were made for the Narcotics in the year 2011 and for Assaults in the year
2014.
4.CONCLUSION
The prediction and analysis of the data could be useful for the Police Department, when deciding
which areas to allocate more resources (this also depends on the crime type, which we have
covered in our analysis).
If they want to increase the number of arrests for a particular crime type, then where (area)
should they focus their efforts? (The beat number, in our prediction would give an idea!)
Which type of crime is more prone to happen at particular location (sidewalks, street, ATM, ..).
From the Visualization, it is evident that the crimes that lead to Arrest such as Narcotics and
Assaults were the highest occurring crimes by weekly, monthly and yearly, with Homicides and
Robberies happening frequently, but not to that extent (in comparison), over the time. Also,
Assaults and Narcotics were the highest occurring crime, for the years 2011 and 2014.
5.FUTURE WORK
This project can be extended which will include Map Visualization. Utilizing the location descriptors,
latitude and longitude, given as attributes, describing the location on the map where the incident has
occurred.
We can plot the hotspots i.e. areas with high crime rates for a given type of crime in a given beat/district.
Analysis of this will help the Police Department of City of Chicago to allocate more resources to red zones
with high criminal activities, to better improve their services in those areas and to be alert for the crimes
that may happen to avoid the incidents.