San Francisco Crime Classification

•Download as PPTX, PDF•

1 like•962 views

This document presents a project outline for classifying San Francisco crime data. The problem is to predict the category of crime based on time, location, and other data. The outline includes sections on data understanding, visualization, prediction methodologies, and validation. Prediction methods to be tested are decision trees with two-way and three-way splits, gradient boosting, and an ensemble model. Visualizing the data spatially by zip code improved the models. The best model was a three-way decision tree with a misclassification rate of 0.135668. Including demographic data and time series analysis may further improve the model.

Data & Analytics

SAN FRANCISCO CRIME
CLASSIFICATION
Sai Praneeth

Project Outline
1.Problem Identification
2.Data Understanding & Cleansing
3.Data Visualization
4.Prediction Methodologies
5.Validation & Scoring

Problem Identification
Current State
• The current crime index of
S.F is 3(Safer than 3% of
the cities in the US.)
• 67.67 annual crimes per
1,000 residents.
• Don’t have model to
predict crimes based on
location and time
Future State
• A proper model
predicting crime based
on Date, Time and
Location.
• Help the corrections
department to act
properly with corrective
measures based on our
model.
• What are the different metrics that
influence response?
• Is the data enough to give us a clear
picture of crime committed?
• What kind of model best fits the
data?

Problem Statement
• Given time and location, you must predict the category of
crime that occurred.
• This competition's dataset provides nearly 12 years of crime
reports from across all of San Francisco's neighborhoods.
• It also encourages us to explore the dataset visually.

Data Overview
Timestamp
Category(Different Crimes)
Description
Resolution
Day of Week
PdDistrict
Address
Longitude & Latitude

Data Cleansing and Manipulation
Cleaning The Data
Check for Missing values
Check for Entry errors
Check for Duplicates
Check for outliers
Manipulating The Data
Time Stamp
Address
Longitude
Latitude

Variables Selection & Data Partition
• Data Partition
▫ 60:40

1. Decision Tree (Two-way split)
• This decision tree with typical two way split.
• In the properties panel the method was changed to assessment and the
assessment measure was changed to decision as we are trying to classify
the categorical variables.

1.Decision Tree (Two-way split)
• Most Important variable for split -> Zip code
• No of leaves in the pruned tree -> 6
• Validation Misclassification 0.273474

2. Decision Tree (Three-way splits)
• This decision tree has three way split.
• In the properties panel we changed the maximum branch to three and we
still have the same assessment criteria.
• This greatly increased model accuracy.

2. Decision Tree (Three-way splits)
• Most Important variable for split -> Zip codes
• No of leaves in the pruned tree -> 7
• Validation Misclassification -> 0.134316

3.Gradient Boosting
• “Gradient boosting is a boosting approach that resamples the data set
several times to generate results that form a weighted average of the
resampled data set. Tree boosting creates a series of decision trees which
together form a single predictive model”
• Here the assessment measure is taken as misclassification.
• The Train proportion is taken as 60%
• Most Important variable for split -> PDistrict
• Validation Misclassification -> 0.34221

4.Ensemble model
• Combination of all the four models.
• Validation misclassification of 0.141683

Model Comparison
• Best model is Three way decision tree with misclassification of 0.135668
• Model drastically improved after converting latitude and longitude to zip
codes.

Betterment of Model
• Demographics Data Inclusion
• Time Series Analysis

What's hot

Using machine learning algorithms tomlaij

Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir

Fundamentalsof Crime Mapping Tactical Analysis ConceptsOsokop

An New Attractive Mage Technique Using L-Diversity mlaij

Achieving Optimal Privacy in Trust-Aware Collaborative Filtering Recommender ...Nima Dokoohaki

Di35605610IJERA Editor

"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"Government of India and Tata Trusts

15-088-pubTerrance Savitsky

Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal

Performance Analysis of a Gaussian Mixture based Feature Selection Algorithmrahulmonikasharma

An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes

Information Security Risk Analysis Using Analytic Hierarchy Process and Fuzzy...IJCSIS Research Publications

Saif_CCECE2007_full_paper_submittedSaif Kabir, P.Eng., PMP® , M.A.Sc(ECE)

Mat 255 chapter 3 notesadrushle

USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKINGIJDKP

Comparative study of various approaches for transaction Fraud Detection using...Pratibha Singh

IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...IRJET Journal

Poor man's missing value imputationLeonardo Auslender

Mkt researchAyushi Jain

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua

What's hot (20)

Using machine learning algorithms to

Survey of Data Mining Techniques on Crime Data Analysis

Fundamentalsof Crime Mapping Tactical Analysis Concepts

An New Attractive Mage Technique Using L-Diversity

Achieving Optimal Privacy in Trust-Aware Collaborative Filtering Recommender ...

Di35605610

"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"

15-088-pub

Cancer data partitioning with data structure and difficulty independent clust...

Performance Analysis of a Gaussian Mixture based Feature Selection Algorithm

An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...

Information Security Risk Analysis Using Analytic Hierarchy Process and Fuzzy...

Saif_CCECE2007_full_paper_submitted

Mat 255 chapter 3 notes

USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING

Comparative study of various approaches for transaction Fraud Detection using...

IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...

Poor man's missing value imputation

Mkt research

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)

Similar to San Francisco Crime Classification

Intro to Statistics.pptxElyada Wigati Pramaresti

Data Wrangling_1.pptxPallabiSahoo5

Lect 3 background mathematics for Data Mininghktripathy

A high level overview of all that is AnalyticsRamkumar Ravichandran

Lect 3 background mathematicshktripathy

PCA.pptxtestuser473730

Feature selection with imbalanced data in agricultureAboul Ella Hassanien

AERA 2007 Developing benchmarkRoss Brown, PhD, Senior-Level Psychometrician

BS 1 and 2 30th Oct.pptxTanMak1

Outlier analysis and anomaly detectionShantanuDeosthale

EXPLORATORY DATA ANALYSISBabasID2

Measure of central tendency Kannan Iyanar

Module 3 Identifying fraud in forensic analysis.pptxIqbalAli61

Estimation of the probability of default : Credit RishArsalan Qadri

Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences

Engineering Statistics Bahzad5

QUANTITATIVE-DATA.pptxViaFortuna

2.2 Mesure Phase (1).pptxnirajchaudhari27

Res701 research methodology lecture 7 8-devaprakasamVIT University (Chennai Campus)

R - what do the numbers mean? #RStatsJen Stirrup

Similar to San Francisco Crime Classification (20)

Intro to Statistics.pptx

Data Wrangling_1.pptx

Lect 3 background mathematics for Data Mining

A high level overview of all that is Analytics

Lect 3 background mathematics

PCA.pptx

Feature selection with imbalanced data in agriculture

AERA 2007 Developing benchmark

BS 1 and 2 30th Oct.pptx

Outlier analysis and anomaly detection

EXPLORATORY DATA ANALYSIS

Measure of central tendency

Module 3 Identifying fraud in forensic analysis.pptx

Estimation of the probability of default : Credit Rish

Conceptual framework for entity integration from multiple data sources - Draz...

Engineering Statistics

QUANTITATIVE-DATA.pptx

2.2 Mesure Phase (1).pptx

Res701 research methodology lecture 7 8-devaprakasam

R - what do the numbers mean? #RStats

Recently uploaded

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RadioAdProWritingCinderellabyButleri.pdfgstagge

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Recently uploaded (20)

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

9654467111 Call Girls In Munirka Hotel And Home Service

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Brighton SEO | April 2024 | Data Storytelling

04242024_CCC TUG_Joins and Relationships

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

E-Commerce Order PredictionShraddha Kamble.pptx

Call Girls In Dwarka 9654467111 Escorts Service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

20240419 - Measurecamp Amsterdam - SAM.pdf

RadioAdProWritingCinderellabyButleri.pdf

Call Girls In Mahipalpur O9654467111 Escorts Service

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

San Francisco Crime Classification

1. SAN FRANCISCO CRIME CLASSIFICATION Sai Praneeth

2. Project Outline 1.Problem Identification 2.Data Understanding & Cleansing 3.Data Visualization 4.Prediction Methodologies 5.Validation & Scoring

3. Problem Identification Current State • The current crime index of S.F is 3(Safer than 3% of the cities in the US.) • 67.67 annual crimes per 1,000 residents. • Don’t have model to predict crimes based on location and time Future State • A proper model predicting crime based on Date, Time and Location. • Help the corrections department to act properly with corrective measures based on our model. • What are the different metrics that influence response? • Is the data enough to give us a clear picture of crime committed? • What kind of model best fits the data?

4. Problem Statement • Given time and location, you must predict the category of crime that occurred. • This competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. • It also encourages us to explore the dataset visually.

5. Data Overview Timestamp Category(Different Crimes) Description Resolution Day of Week PdDistrict Address Longitude & Latitude

6. Data Cleansing and Manipulation Cleaning The Data Check for Missing values Check for Entry errors Check for Duplicates Check for outliers Manipulating The Data Time Stamp Address Longitude Latitude

7. Data Visualization

8. Data Visualization

9. Data Visualization

10. Data Visualization

11. Data Visualization

12. Variables Selection & Data Partition • Data Partition ▫ 60:40

13. Project Diagram

14. 1. Decision Tree (Two-way split) • This decision tree with typical two way split. • In the properties panel the method was changed to assessment and the assessment measure was changed to decision as we are trying to classify the categorical variables.

15. 1.Decision Tree (Two-way split) • Most Important variable for split -> Zip code • No of leaves in the pruned tree -> 6 • Validation Misclassification 0.273474

16. 1. Decision Tree (Two-way split)

17. 2. Decision Tree (Three-way splits) • This decision tree has three way split. • In the properties panel we changed the maximum branch to three and we still have the same assessment criteria. • This greatly increased model accuracy.

18. 2. Decision Tree (Three-way splits) • Most Important variable for split -> Zip codes • No of leaves in the pruned tree -> 7 • Validation Misclassification -> 0.134316

19. 2. Decision Tree (Three-way splits)

20. 3.Gradient Boosting • “Gradient boosting is a boosting approach that resamples the data set several times to generate results that form a weighted average of the resampled data set. Tree boosting creates a series of decision trees which together form a single predictive model” • Here the assessment measure is taken as misclassification. • The Train proportion is taken as 60% • Most Important variable for split -> PDistrict • Validation Misclassification -> 0.34221

21. 4.Ensemble model • Combination of all the four models. • Validation misclassification of 0.141683

22. Model Comparison • Best model is Three way decision tree with misclassification of 0.135668 • Model drastically improved after converting latitude and longitude to zip codes.

23. Betterment of Model • Demographics Data Inclusion • Time Series Analysis

24. Questions THANK YOU

San Francisco Crime Classification

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to San Francisco Crime Classification

Similar to San Francisco Crime Classification (20)

Recently uploaded

Recently uploaded (20)

San Francisco Crime Classification