This document presents a project outline for classifying San Francisco crime data. The problem is to predict the category of crime based on time, location, and other data. The outline includes sections on data understanding, visualization, prediction methodologies, and validation. Prediction methods to be tested are decision trees with two-way and three-way splits, gradient boosting, and an ensemble model. Visualizing the data spatially by zip code improved the models. The best model was a three-way decision tree with a misclassification rate of 0.135668. Including demographic data and time series analysis may further improve the model.
3. Problem Identification
Current State
• The current crime index of
S.F is 3(Safer than 3% of
the cities in the US.)
• 67.67 annual crimes per
1,000 residents.
• Don’t have model to
predict crimes based on
location and time
Future State
• A proper model
predicting crime based
on Date, Time and
Location.
• Help the corrections
department to act
properly with corrective
measures based on our
model.
• What are the different metrics that
influence response?
• Is the data enough to give us a clear
picture of crime committed?
• What kind of model best fits the
data?
4. Problem Statement
• Given time and location, you must predict the category of
crime that occurred.
• This competition's dataset provides nearly 12 years of crime
reports from across all of San Francisco's neighborhoods.
• It also encourages us to explore the dataset visually.
6. Data Cleansing and Manipulation
Cleaning The Data
Check for Missing values
Check for Entry errors
Check for Duplicates
Check for outliers
Manipulating The Data
Time Stamp
Address
Longitude
Latitude
14. 1. Decision Tree (Two-way split)
• This decision tree with typical two way split.
• In the properties panel the method was changed to assessment and the
assessment measure was changed to decision as we are trying to classify
the categorical variables.
15. 1.Decision Tree (Two-way split)
• Most Important variable for split -> Zip code
• No of leaves in the pruned tree -> 6
• Validation Misclassification 0.273474
17. 2. Decision Tree (Three-way splits)
• This decision tree has three way split.
• In the properties panel we changed the maximum branch to three and we
still have the same assessment criteria.
• This greatly increased model accuracy.
18. 2. Decision Tree (Three-way splits)
• Most Important variable for split -> Zip codes
• No of leaves in the pruned tree -> 7
• Validation Misclassification -> 0.134316
20. 3.Gradient Boosting
• “Gradient boosting is a boosting approach that resamples the data set
several times to generate results that form a weighted average of the
resampled data set. Tree boosting creates a series of decision trees which
together form a single predictive model”
• Here the assessment measure is taken as misclassification.
• The Train proportion is taken as 60%
• Most Important variable for split -> PDistrict
• Validation Misclassification -> 0.34221
22. Model Comparison
• Best model is Three way decision tree with misclassification of 0.135668
• Model drastically improved after converting latitude and longitude to zip
codes.