This white paper describes the analysis and models developed to predict crimes in Chicago city. Further, the models are compared and most effective and simple model is recommended with the conclusion.
Subscription fraud analytics using classificationSomdeep Sen
A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis Plan where the callers are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day. Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their services. Bad Idea has their call logs spanning over one and half months.
The analytics team of the company has been provided two data sets: Black-List Subscriber Call-Logs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an audit to see whether these Fraudsters have joined their network. The company reviews the list of subscribers who have made calls to the same people as these three fraudsters and in the same time frame. This has been provided in the Audit Log.
Test Data: http://bit.ly/1du9cRs
Training Data: http://bit.ly/1du9AQ1
Telecom Fraud Detection - Naive Bayes ClassificationMaruthi Nataraj K
To create a fraud management classification model that is powerful enough to handle the subscription fraud that the company has encountered and flexible enough to potentially apply to things that had not been witnessed yet
Subscription fraud analytics using classificationSomdeep Sen
A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis Plan where the callers are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day. Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their services. Bad Idea has their call logs spanning over one and half months.
The analytics team of the company has been provided two data sets: Black-List Subscriber Call-Logs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an audit to see whether these Fraudsters have joined their network. The company reviews the list of subscribers who have made calls to the same people as these three fraudsters and in the same time frame. This has been provided in the Audit Log.
Test Data: http://bit.ly/1du9cRs
Training Data: http://bit.ly/1du9AQ1
Telecom Fraud Detection - Naive Bayes ClassificationMaruthi Nataraj K
To create a fraud management classification model that is powerful enough to handle the subscription fraud that the company has encountered and flexible enough to potentially apply to things that had not been witnessed yet
A Survey on Data Mining Techniques for Crime Hotspots PredictionIJSRD
A crime is an act which is against the laws of a country or region. The technique which is used to find areas on a map which have high crime intensity is known as crime hotspot prediction. The technique uses the crime data which includes the area with crime rate and predict the future location with high crime intensity. The motivation of crime hotspot prediction is to raise people’s awareness regarding the dangerous location in certain time period. It can help for police resource allocation for creating a safe environment. The paper presents survey of different types of data mining techniques for crime hotspots prediction.
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
Our proposed model will be able to extract crime patterns by using association rule mining and clustering to classify crime records on the basis of the values of crime attributes.
SUPERVISED AND UNSUPERVISED MACHINE LEARNING METHODOLOGIES FOR CRIME PATTERN ...ijaia
Crime is a grave problem that affects all countries in the world. The level of crime in a country has a big
impact on its economic growth and quality of life of citizens. In this paper, we provide a survey of trends of
supervised and unsupervised machine learning methods used for crime pattern analysis. We use a spatiotemporal dataset of crimes in San Francisco, CA to demonstrate some of these strategies for crime
analysis. We use classification models, namely, Logistic Regression, Random Forest, Gradient Boosting
and Naive Bayes to predict crime types such as Larceny, Theft, etc. and propose model optimization
strategies. Further, we use a graph based unsupervised machine learning technique called core periphery
structures to analyze how crime behavior evolves over time. These methods can be generalized to use for
different counties and can be greatly helpful in planning police task forces for law enforcement and crime
prevention.
Supervised and Unsupervised Machine Learning Methodologies for Crime Pattern ...gerogepatton
Crime is a grave problem that affects all countries in the world. The level of crime in a country has a big
impact on its economic growth and quality of life of citizens. In this paper, we provide a survey of trends of
supervised and unsupervised machine learning methods used for crime pattern analysis. We use a spatiotemporal dataset of crimes in San Francisco, CA to demonstrate some of these strategies for crime
analysis. We use classification models, namely, Logistic Regression, Random Forest, Gradient Boosting
and Naive Bayes to predict crime types such as Larceny, Theft, etc. and propose model optimization
strategies. Further, we use a graph based unsupervised machine learning technique called core periphery
structures to analyze how crime behavior evolves over time. These methods can be generalized to use for
different counties and can be greatly helpful in planning police task forces for law enforcement and crime
prevention.
Crime Data Analysis, Visualization and Prediction using Data MiningAnavadya Shibu
This paper presents a general idea about the
model of Data Mining techniques and diverse crimes. It also
provides an inclusive survey of competent and valuable
techniques on data mining for crime data analysis. The
objective of the data mining is to recognize patterns in
criminal manners in order to predict crime anticipate
criminal activity and prevent it. This project implements a
novel data mining techniques like KNN, Text Clustering, IR
tree for investigating the crime data sets and sorts out the
accessible problems. The collective knowledge of various
data mining algorithms tend certainly to afford an
enhanced, incorporated, and precise result over the crime
prediction in the banking sectors Our law enforcement
organizations require to be adequately outfitted to defeat
and prevent the crime. This project is developed using Java
as front-end and MySQL as back-end. Supporting
applications like Sunset, NetBeans are used to make the
portal more interactive.
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
With a substantial increase in crime across the globe, there is a need for analysing the crime data to lower the crime rate. This helps the police and citizens to take necessary actions and solve the crimes faster. In this paper, data mining techniques are applied to crime data for predicting features that affect the high crime rate. Supervised learning uses data sets to train, test and get desired results on them whereas Unsupervised learning divides an inconsistent, unstructured data into classes or clusters. Decision trees, Naïve Bayes and Regression are some of the supervised learning methods in data mining and machine learning on previously collected data and thus used for predicting the features responsible for causing crime in a region or locality. Based on the rankings of the features, the Crimes Record Bureau and Police Department can take necessary actions to decrease the probability of occurrence of the crime.
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
With a substantial increase in crime across the globe, there is a need for analysing the crime data to lower
the crime rate. This helps the police and citizens to take necessary actions and solve the crimes faster. In
this paper, data mining techniques are applied to crime data for predicting features that affect the high
crime rate. Supervised learning uses data sets to train, test and get desired results on them whereas
Unsupervised learning divides an inconsistent, unstructured data into classes or clusters. Decision trees,
Naïve Bayes and Regression are some of the supervised learning methods in data mining and machine
learning on previously collected data and thus used for predicting the features responsible for causing
crime in a region or locality. Based on the rankings of the features, the Crimes Record Bureau and Police
Department can take necessary actions to decrease the probability of occurrence of the crime.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
This paper focuses on finding spatial and temporal criminal hotspots. It analyses two different real-world crimes datasets for Denver, CO and Los Angeles, CA and provides a comparison between the two datasets through a statistical analysis supported by several graphs. Then, it clarifies how we conducted Apriori algorithm to produce interesting frequent patterns for criminal hotspots. In addition, the paper shows how we used Decision Tree classifier and Naïve Bayesian classifier in order to predict potential crime types. To further analyse crimes’ datasets, the paper introduces an analysis study by combining our findings of Denver crimes’ dataset with its demographics information in order to capture the factors that might affect the safety of neighborhoods. The results of this solution could be used to raise people’s awareness regarding the dangerous locations and to help agencies to predict future crimes in a specific location within
a particular time.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
A Survey on Data Mining Techniques for Crime Hotspots PredictionIJSRD
A crime is an act which is against the laws of a country or region. The technique which is used to find areas on a map which have high crime intensity is known as crime hotspot prediction. The technique uses the crime data which includes the area with crime rate and predict the future location with high crime intensity. The motivation of crime hotspot prediction is to raise people’s awareness regarding the dangerous location in certain time period. It can help for police resource allocation for creating a safe environment. The paper presents survey of different types of data mining techniques for crime hotspots prediction.
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
Our proposed model will be able to extract crime patterns by using association rule mining and clustering to classify crime records on the basis of the values of crime attributes.
SUPERVISED AND UNSUPERVISED MACHINE LEARNING METHODOLOGIES FOR CRIME PATTERN ...ijaia
Crime is a grave problem that affects all countries in the world. The level of crime in a country has a big
impact on its economic growth and quality of life of citizens. In this paper, we provide a survey of trends of
supervised and unsupervised machine learning methods used for crime pattern analysis. We use a spatiotemporal dataset of crimes in San Francisco, CA to demonstrate some of these strategies for crime
analysis. We use classification models, namely, Logistic Regression, Random Forest, Gradient Boosting
and Naive Bayes to predict crime types such as Larceny, Theft, etc. and propose model optimization
strategies. Further, we use a graph based unsupervised machine learning technique called core periphery
structures to analyze how crime behavior evolves over time. These methods can be generalized to use for
different counties and can be greatly helpful in planning police task forces for law enforcement and crime
prevention.
Supervised and Unsupervised Machine Learning Methodologies for Crime Pattern ...gerogepatton
Crime is a grave problem that affects all countries in the world. The level of crime in a country has a big
impact on its economic growth and quality of life of citizens. In this paper, we provide a survey of trends of
supervised and unsupervised machine learning methods used for crime pattern analysis. We use a spatiotemporal dataset of crimes in San Francisco, CA to demonstrate some of these strategies for crime
analysis. We use classification models, namely, Logistic Regression, Random Forest, Gradient Boosting
and Naive Bayes to predict crime types such as Larceny, Theft, etc. and propose model optimization
strategies. Further, we use a graph based unsupervised machine learning technique called core periphery
structures to analyze how crime behavior evolves over time. These methods can be generalized to use for
different counties and can be greatly helpful in planning police task forces for law enforcement and crime
prevention.
Crime Data Analysis, Visualization and Prediction using Data MiningAnavadya Shibu
This paper presents a general idea about the
model of Data Mining techniques and diverse crimes. It also
provides an inclusive survey of competent and valuable
techniques on data mining for crime data analysis. The
objective of the data mining is to recognize patterns in
criminal manners in order to predict crime anticipate
criminal activity and prevent it. This project implements a
novel data mining techniques like KNN, Text Clustering, IR
tree for investigating the crime data sets and sorts out the
accessible problems. The collective knowledge of various
data mining algorithms tend certainly to afford an
enhanced, incorporated, and precise result over the crime
prediction in the banking sectors Our law enforcement
organizations require to be adequately outfitted to defeat
and prevent the crime. This project is developed using Java
as front-end and MySQL as back-end. Supporting
applications like Sunset, NetBeans are used to make the
portal more interactive.
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
With a substantial increase in crime across the globe, there is a need for analysing the crime data to lower the crime rate. This helps the police and citizens to take necessary actions and solve the crimes faster. In this paper, data mining techniques are applied to crime data for predicting features that affect the high crime rate. Supervised learning uses data sets to train, test and get desired results on them whereas Unsupervised learning divides an inconsistent, unstructured data into classes or clusters. Decision trees, Naïve Bayes and Regression are some of the supervised learning methods in data mining and machine learning on previously collected data and thus used for predicting the features responsible for causing crime in a region or locality. Based on the rankings of the features, the Crimes Record Bureau and Police Department can take necessary actions to decrease the probability of occurrence of the crime.
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
With a substantial increase in crime across the globe, there is a need for analysing the crime data to lower
the crime rate. This helps the police and citizens to take necessary actions and solve the crimes faster. In
this paper, data mining techniques are applied to crime data for predicting features that affect the high
crime rate. Supervised learning uses data sets to train, test and get desired results on them whereas
Unsupervised learning divides an inconsistent, unstructured data into classes or clusters. Decision trees,
Naïve Bayes and Regression are some of the supervised learning methods in data mining and machine
learning on previously collected data and thus used for predicting the features responsible for causing
crime in a region or locality. Based on the rankings of the features, the Crimes Record Bureau and Police
Department can take necessary actions to decrease the probability of occurrence of the crime.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
This paper focuses on finding spatial and temporal criminal hotspots. It analyses two different real-world crimes datasets for Denver, CO and Los Angeles, CA and provides a comparison between the two datasets through a statistical analysis supported by several graphs. Then, it clarifies how we conducted Apriori algorithm to produce interesting frequent patterns for criminal hotspots. In addition, the paper shows how we used Decision Tree classifier and Naïve Bayesian classifier in order to predict potential crime types. To further analyse crimes’ datasets, the paper introduces an analysis study by combining our findings of Denver crimes’ dataset with its demographics information in order to capture the factors that might affect the safety of neighborhoods. The results of this solution could be used to raise people’s awareness regarding the dangerous locations and to help agencies to predict future crimes in a specific location within
a particular time.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. OPIM5604-SECB12
Group Project Final Report
Instructor: Iva Stricevic
CRIMES IN CHICAGO
Team Members:
Swati Arora
Ankita Paunikar
Siddharth Rai
Anoop Ramathirtha
Kees Van Haasteren
2. SUMMARY
The dataset can be found on https://www.kaggle.com/currie32/crimes-in-chicago .It
contains 1,456,714 records/rows and 23 variables to describe the data. The variables
contain some of the important information about the crime in Chicago like the date of
crime, location, type of crime and various other description.
The objective is to analyze the data and find possible patterns between various variables
and predict whether the criminal will be arrested for criminal offense or not depending on
the provided dataset. The offense rate variation based on time is also explored. With such
understanding, authorities can proactively take measures to prevent some of the potential
crimes. In addition, any other observation, unrelated to the prediction goal, will be
recorded and summarized in the paper.
Objectives
I. Data description
II. Data visualization and pattern discovery
III. Data pre-processing
IV. Data Distribution
V. Model Evaluation
VI. Conclusion
3. I. Data description
This dataset analyzes the criminal behavior in the city of Chicago in a variety of
different ways. It is a compilation of reported criminal activities within the city limits to
accurately depict each individual crime, the aspects surrounding the crime, and the activities.
With this data, we hope to be able to spot trends in the crimes being committed and the
arresting patterns of officers to determine where the strengths and weakness of the Chicago
PD lies. Is there an area or district within the city which has a low arrest rate and requires
more officers? We hope to be able to accurately predict whether a crime, given the nature of
the crime, the location of the crime, and other aspects of the report, will result in an arrest
and justice for those who have been aggrieved.
The data includes 1,456,714 rows (individual crimes) which have taken place from
2012 to the early months of 2017. The data set recognizes 23 facets of the crime, which will
serve as our columns in the predictive model. These columns can be divided into four
categories. The first are identifiers, the data point is unique and allows people studying the
crime to find individual crimes to focus on. The second are descriptors of the crime type,
which dictate what the crime was, and the severity. The third are location identifiers,
showing the location of the crime, whose jurisdiction it fall under. The fourth category is a
grab bag of the remainder, including the result (arrest/no arrest), the update information
and the time of the crime.
Identifiers: Column 1 is a number based solely in this dataset, simply a running tally of the
crimes. ID and Case Number were assigned by Chicago PD, to allow each case a unique
identification to easier track and solve the crimes. All identifiers must not be included for the
model when it is created, as they are unique numbers which are not continuous.
Crime Type: There are 4 columns, each identifying the type of crime. IUCR and FBI Code are
each based on assigning different crimes and severities a number, so that each crime can be
grouped on a state or nationwide level. Primary Type divides the crimes into police officer
described categories. These will need to be grouped in order to use them effectively.
Description involves the severity of the crime, whether weapons were used, etc. and were
also described by the police officer reporting. Because of the large variety of reports, it
becomes difficult to group or label crimes together in this category.
Location: There are 11 different columns describing some variation of the location of the
crime. Ward, District, Beat and Community Area, are all different levels of establishing which
officers are patrolling which area. The city is divided into 3 districts, which are subdivided
into 77 Community areas and 50 Wards (Precincts or Police Stations) and each of those are
divided into beats, describing where each team of officers is scheduled to patrol. The other
type of location column are exact descriptors, the city block, longitude and latitude or a
description of the crime scene (street or residence, for example).
Other Columns: Date and year will describe when the crime occurred, we are able to break
4. the time into hour of day and month to show more accurate cycles of crime throughout a day
or year.
Predicted Value: The column arrest is a binary choice, whether or not an arrest was made in
this particular crime. We will attempt to project a yes or no binary outcome (shown as 1 for
arrest, 0 for no arrest) in this model.
II. Data visualization and pattern discovery
1. Top ten communities with most Arrest and least number of Arrest
Most Arrest Least Arrest
2. The Primary Crime Types versus primary crimes which lead to Arrest
Primary Crime Types Primary Crime Types with most Arrest
3. Most of the crimes are committed during Summer (July, August, June, May), whereas
less criminal activities during winters (February, December, November and April )
5. 4. Less number of crimes during morning hours as compared to night.
6. III. Data pre-processing
1. Datatype: The datatype of all the variables are correct.
2. Outliers: By plotting the distribution of the variables and selecting outlier
boxplot the Latitude and Longitude variables has 77 outlier values for same
rows in the dataset. These values are too less and can be removed.
3. Missing Data: Used missing data pattern to check the missing values in the
dataset. X coordinate, Y coordinate, latitude and longitude had some missing
values. The total number of rows with missing values were 38,349 which is
less than 5% of the total data. So, we can remove the rows with missing
values because the variables which have missing values are not useful for our
classification.
7. 4. Recode: There are various columns which have nominal datatype. To predict
the Arrest, we need to reduce the number of values in those columns, so that
the model can provide us the best results. With our understanding of the
dataset we recoded the values to reduce the number of option for a column.
● Primary Type: From 29 to 7 values
8. ● Location Description : From 143 to 9 values.
● Districts of Chicago PD, binned to Zones : From 25 to 3 values.
Pattern Discovery
Principal Component Analysis: X and Y coordinates are correlated with the
Latitudes and Longitudes. Hence, only the Latitudes and Longitudes were considered for
prediction. Other variables like ward, beat etc are not correlated but are a measure of
Location with varying degrees of land area coverage. Only the the largest of these i.e.
District was considered, and bined further into three zones – North, Central and South – to
arrive at a better predictive model.
9. IV. Data Distribution
We tried to create the model using various distribution methods like: Stratified and
random stratified. For the final model, the data is distributed into training, validation and
test part (60:30:10).
10. V. Model Evaluation:
Our initial strategy is to select the most effective model be running all three primary
model types, and comparing and contrasting the benefits of each, specifically analyzing the
RSquared values, which is the percentage of the predicted value in the model which can be
explained by the predictor values. We will also compare the values of the misclassification
rate, which, in a binary model, tells the percentage of rows which were predicted to be true
(the crime results in an arrest) but are actually false (no arrest made) or vice versa. The
three different types of models are logistic regression models, classification models, and
neural networks. The results of each of these models are displayed below.
• Logistic Model
11. The above chart shows the best available logistic regression model, which displays
disappointingly low values for the Entropy RSquared rate at only .3240, meaning only 32%
of the model (which is still highly descriptive according to the Chi Squared value at the top)
can be explained by the predictor value which we used in the model. The model is
misclassifying rows at 16.41%, particularly concerning is the rate at which crimes were
predicted that no arrest would be made but one actually was (false positive
misclassification). After looking at the results of the other available models, logistic
regression will be discarded as the least effective model in this particular study.
• Neural Net
Above is the neural net model. After some trial and error, our team decided to use the
Hyperbolic Tangent method, using two layers of three nodes apiece. Despite this being the
best available neural networking model, it was not much more effecting than the logistic
regression model above, boasting the entropy RSquared of 35% percent, not nearly as
descriptive as we would have hoped that our model would be. The model is good in that the
validation and test statistics are equally as low as the training rate, but few positives can be
found. Misclassification rate is still around 15.4% (as compared to 16.4% above) but it is
still consistently classifying false positives.
12. • Classification Tree Model
Finally, above is the classification model, showing the efficacy of the best available model
for this particular dataset. The model boasts close to 500 individual splits, approximately
470 after pruning, however any number of splits more than about 50 would have returned
virtually identical results in terms of our peak Entropy RSquare value of .3834. By
definition, this means that less than 40% of our results are explained by the data in our
model, rendering the dataset less than effective at predicting whether an arrest will be
made in a particular datapoint. The classification rate remains quite high at 14.9%, and for
the first time in our model, false positives are outnumbered by accurate arrest predictions.
The chart displayed below shows the lift ratio for our model, demonstrating the
final test of efficiency for the final model we chose (the classification model). At lower
portions, we see that the model accurately predicts the arrest rates nearly 4 times better
than other methods. This may be viewed as a tepid success because of this result.
13. VI. Conclusion
Given a very limited data set in terms of descriptive characteristics regarding each
crime, with three quarters of data points for any given crime describing the location or time
of the crime. This limiting dataset tended to hold back our model in many respects as many
of the location figures were so highly correlated or not particularly useful predictors. It is
remarkably clear that we need different types of information in order to produce a model
which can explain 80 percent of the variance or more.
Simply put, there is a lot more to police work than the location of the crime and the
type of crime. Some interesting data points that may have been useful include who was
reporting the crime, whether it be the victim, a witness or the police officer him/herself.
You might imagine that this would show high predictive power. Or potentially response
time to the reported crime may affect the arrest rate. Much of the probability of making an
arrest can hinge on factors down to the competency level of the officer investigating.
Despite a low explanation rate or RSquared and a high misclassification rate, we can
still learn much from the valuable work done in creating this model. Firstly, we determined
that much of this data did have real effective predictive power, as is evidenced by the lift
rate of nearly 4 times. We can also take knowledge away from the visualization process,
which showed many cases where Chicago PD were having success, such as the high arrest
rate in the field of narcotics. We can also take note of the cyclical nature of criminal activity,
particularly noting the drop in outdoor crimes during winter and fall months, and the drop-
in crimes that occur between 2 and 10 AM, requiring smaller active forces during those
times.