This document outlines a project to analyze crime and census data in London. It describes a multi-phase approach including: 1) loading and visualizing crime data, 2) adding census data to the model and performing clustering and regression analysis, and 3) using the results to inform data mining. Key analysis techniques include k-means clustering of census variables to categorize areas, linear regression of census factors on crime types, and decision tree analysis using both crime and census data. The goal is to understand how socioeconomic factors relate to crime levels and types in different parts of London.
2. Project Roadmap
Phase 1 Crime Data
Phase 2 Census Data
1A - Dashboard
2B - Linear Regression
2A - Clustering
2C - Data Mining
1B – Heat Map
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 2
3. Data Sources
• The Metropolitan Police Service (MPS) releases monthly
anonymised crime data.
• The Office for National Statistics (ONS) conducts censuses
every 10 years and releases demographic statistics based
on the responses. The latest available is from 2011.
• Geographic data was imported from ONS. The latest Local
Authority boundaries date from 2015.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 3
4. Geographic granularity
MSOAs
LSOAs
Locations
(streets)
• MPS Crime location data includes Lower-Layer Super
Output Areas (LSOA).
• LSOA can be aggregated to various local authority units
and to Middle-Layer Super Output Areas (MSOA).
• MSOA tends to be the lowest level of geography at which
Census data is published.
• MSOA is a more consistent geographic level than local
authority. An MSOA is targeted to have between 5000 and
15000 population.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 4
5. Objectives
My objectives are to:
a. Visualise London crime data.
b. Analyse crime stats to highlight any contributory
demographic factors.
The analysis techniques to be employed will include:
• Linear regression
• K-means clustering
• Data mining using decision trees
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 5
6. High Level Architecture
• SSIS will be used to load into a MSSQL database.
• T-SQL queries will be used to populate a data warehouse.
• Power BI will be used for visualisations
• SSAS will be used to develop a data cube.
• Excel will be used for reporting and linear regression
• Python will be used for clustering
• SSAS will be used for data mining and decision tree
analysis.
MSSQL
SSAS
SSIS
Python
Cloud
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 6
8. DataSources
The Metropolitan Police Service (MPS) releases monthly anonymised
crime data, alongside a history of the previous two years, at
https://data.police.uk/data/
They use street snap-points to geomask the locations and do not
identify the day of the month.
The MPS data includes locations at street level, including the LSOA
(Lower Layer Super Output Area) code, which allows the data to be
aggregated at LocalAuthority (LAD),Ward level or Middle-Layer Super
Output Areas (MSOA) level.
Geographic data is sources from https://data.gov.uk/dataset/bc0d1720-
0275-490d-a7da-d22e69495314/lower-layer-super-output-area-2011-to-
ward-2015-lookup-in-england-and-wales
9. Tools
SSIS will be used to load into SQLServer staging
tables.
T-SQL queries will be used to populate a
SQLServer data warehouse.
(By keeping staging tables separate from the data
warehouse, transformation will be handled by SQL
queries for reasons of scalability and transparency).
Power BI will be used to develop the dashboard.
MSSQL
DW
MSSQL
Staging
Power
BI
SSIS
Cloud
10. ETLProcess
Download and extract the required Crime and geographic data
using SSIS
Create PoliceStats database and staging tables on SQL Server.
Use SSIS to iterate over the folder structure to upload the
Crime data to SQL Server.
Use SSIS to upload the geographic data to SQL Server.
Extract
Extract
Transform
Load
Number of Months (CSVs) 39
Number of Locations 65,699
Number of Recorded Crimes 3,677,871
11. ETLProcess
The data is well formed and LSOA codes linking the two data
sources are 100% valid requiring no data cleansing.
MPS do not currently use the fields FallsWithin and Reported
By to indicate where other Police Services are involved
Crimes recorded by the MPS at geographical locations outside
of the Met Police area (London Boroughs) are included and could
distort the results.
I added ‘No Location’ records with zero keys in LAD,Ward,
MSOA, LSOA and Location tables and updated the Crime data
with missing location to reference the ‘No Location’ record.
Transform
Extract
Transform
Load
12. ETLProcess
Created Dimension and Fact schemas and the tables for the
SQL Server data warehouse.Added foreign key constraints and
indexes.
Merged data from staging tables to DataWarehouse.
To enable a London only visualisation and analysis of the data I
created views that limit the selection to a LAD code beginning
with ‘E09’ which indicates a London Borough.
Crime data is released monthly, so is added using the same
methods but without affecting data from the initial load.
Load
Extract
Transform
Load
14. DashboardVisualisationFeatures
Selection byWard Name (single select only).
Map display based on Latitude and Longitude.Tooltip on rollover to
include Location Name. Size of marker based on Crime Count. Filter
available on location.
CrimeTrend stacked area chart, displayingCrimeType illustrated by
colour, with Crime Count on y-axis and Month on x-axis.Allow
selection onYear, Quarter and Month, or on CrimeType.
Month slicer to allow selection of any start point and end point.
Selection by CrimeType, allowing multiple selection. Include Select
All option and ability to deselect.
15.
16.
17. LondonCrime
HeatMap
I also took a view of London
using a Heat Map as preparation
for further analysis.
Crime levels are generally higher
towards the centre, but there is
a patchiness which is more
prominent. There is a West End
hot spot.
The next phase will attempt to
explain the influencing factors.
19. Phase 2 Census Analysis
GoalsoftheAnalysis
To understand whether and
to what extent crime levels
and different types of crime
are affected by socio-
economic and demographic
factors.
Building on the Crime data
by adding MSOA-level
Census data.
Data
Warehouse
Phase 2
Census Data
ETL
Phase 1
Crime Data
ETL
Location
Clustering
Linear
Regression
Data
Mining
20. ReviewofCrimeStudies
Crime data has been subject to quite extensive analysis using data
mining and machine learning with the objectives of crime prediction.
Clearly a time series analysis using past crime data has the most
predictive power, but lacks insight or explicatory power.
London Landscape provides combined crime, demographic and
socio-economic datasets for local authority use.
A US study at LSOA level from the 1990s identified Poverty,
Residential instability, Housing and commuting, Income, Population
and Family Disruption as the ‘themes’ whose measures correlate
well.
I decided to see what could be accomplished by loadingCensus data
and analysing at the level of MSOA within London.
21. The first stage of feature selection involves a selection of the
Census data to be loaded.
The census data presented includes stats using combinations of
features. These are not useful here, as we are interested in
separating the impacts of individual variables, so are ignored.
I ignored stats which offer little chance of displaying much
variability at the level of MSOA. Factors such as Gender and Age
may strongly correlate with an individual’s propensity for
criminality, but people do not tend to concentrate sufficiently
based on those factors at the level of MSOA.
The data was sourced from
https://www.ons.gov.uk/census/2011census/2011censusdata/bulk
data/bulkdatadownloads
CensusDataSelection
22. CensusDataExtract
Census Data Download CSVs
used
Fields Type of Data
Detailed
Characteristics 1
BulkdatadetailedcharacteristicsmsoaE&W
andinfo3.3
6 28 Status, Coupled Y/N, Family Type, No
of Cars, No of Bedrooms,
Occupation
Labour Market BulkdataLabourMarketMSOA3.5aandinfo 1 9 NS-SeC employment classifications
Detailed
Characteristics 2
BulkdataDetailedCharacteristicsMSOAdat
aforE&WLowerGeographiesandinfo1207
5 41 Shared Y/N, Central Heating Y/N,
Occupancy, Ethnicity, Religion,
Residence Y/N, Dwelling Type
The number of rows populated in each case was the number of MSOAs which is 7201.
The subset of this which will be used are the records that relate to London which number 983.
23. CensusDataProcessing
Select ONS data of individual stats for MSOAs
Also download geographic relationships between LSOA
and MSOA
Use SSIS to populate MSSQL Staging tables
Enhance the DataWarehouse to support census data
Cleanse the MSOA data where relationships out of date
MSSQL
SSAS
SSIS
Python
Cloud
25. DataCube
A data cube was created in SSAS to bring together crime data
aggregated to the MSOA level, with the census figures.
A Geography Hierarchy was added to relate the Dimensions Location,
LSOA and MSOA
Crime Per Head is a calculation of the Crime Count divided by
Population
Calculations were created for the variables to be measured
independent of population.
MSOAs
LSOAs
Locations
(streets)
26. FeatureSelection
For each feature selected for the model I created calculated
variables in the data cube.
My intention was that these should be representative of the
socio-economic nature of an area and its social assets,
1. Number of Bedrooms Per Property
2. Number of Cars Per Household
3. Percentage of buildings which are Houses
4. Percentage of persons in AB (Professional/managerial)
Occupations
5. Percentage of Houses over-occupied
These measures I then analysed in Excel.
27. MeasureStats
A quick view of the features
selected to see that they have a
reasonable distribution of data
28. MeasureScorecard
The measures provide variety and represent different
aspects of socio-economic asset availability
1. Bedrooms Per Property - measures the housing stock
including size of properties.
2. Cars Per Household - measures the wealth of the
population and their access to public transport
infrastructure.
3. Percent of houses - measures the nature of the housing
stock including the household living space.
4. Percentage of persons in AB (Professional/managerial)
Occupations – measures the skills and earning potential of
the local population.
5. Percentage of Houses over-occupied – measures the
interior household environment and personal living space.
29. StructureofPhase2
The measures selected will
be used as inputs to all the
subsequent analyses.
Additionally the outputs of
the clustering and linear
regression exercises will be
combined with the
measures for the data
mining exercise.
2A -Cluster
Location
Clusters
2B -Linear
Regression
Crime Type
Categories
2C - Data
Mining
Conclusion
Data
Warehouse
m
e
a
s
u
r
e
s
30. 2A-LocationClustering
To supplement our quantitative measures, I wanted to derive
some qualitative data from the census data, to facilitate other
approaches for analysis and visualisation.
I decided to try to categorise the geographic areas (MSOAs) in
the census data.
I had no pre-conceived notion as to what these clusters should
represent, so I chose an unsupervised method.
The approach chosen is k-means clustering.
31. TechnicalApproach
Scikit-learn provide a python package with a relevant toolset
for k-means clustering.
As we will be comparing indicators that are measured in
differing units, some pre-processing of the data is required to
counteract that impact.This will be by means of the
StandardScaler function.
Standard Python 3 modules will be pandas for set processing
and matplotlib for the visualisations.
I also used Excel and Power BI to visualise and display results
MSSQL
SSAS
Power
BI
Python
33. ElbowMethod
By varying the number of
clusters and rerunning the
clustering algorithm we can plot
the results against the SSE (sum
of the squared error) – a
goodness-of-fit measure.
More clusters will give a better
fit, but we can see that the rate
of improvement reduces
significantly after three, so this is
a sensible number of clusters.
35. CategorisationResults
Category Colour Percentage of
Houses over-
occupied
Number of Cars
Per Household
Percentage of persons in
Professional or
ManagerialOccupations
Deprived Urban Red High Low Low
White Collar Urban Yellow Low Low High
Suburban Blue Low High Mixed
As a sanity-check on the clustering results I loaded the categories into Power BI, combining them with a
the geographic identifiers (latitude and longitude) to display the result as a map.
37. CrimePerHeadbyCluster
Crime levels are significantly
lower in Suburban areas.
Are there any significant
differences between the Crime
types recorded, that depend on
the nature of the environment?
In Excel I merged the data cube
with the cluster categorisations
and displayed the results as a
series of Pie Charts.
First is a baseline Pie Chart
including all of the locations, with
CrimeTypes ordered by volume.
Crime Type
Deprived
Urban Suburban
White-collar
Urban All
Anti-social behaviour 0.184 0.097 0.167 0.155
Violence and sexual offences 0.146 0.082 0.111 0.118
Vehicle crime 0.061 0.051 0.061 0.058
Other theft 0.050 0.028 0.081 0.054
Burglary 0.037 0.031 0.046 0.038
Criminal damage and arson 0.033 0.021 0.027 0.028
Public order 0.030 0.016 0.029 0.026
Drugs 0.030 0.011 0.022 0.022
Theft from the person 0.018 0.003 0.042 0.021
Shoplifting 0.021 0.012 0.028 0.021
Robbery 0.020 0.007 0.021 0.017
Bicycle theft 0.010 0.002 0.018 0.010
Other crime 0.006 0.005 0.004 0.005
Possession of weapons 0.004 0.002 0.003 0.003
All Crime Per Head 0.65 0.37 0.66 0.58
38. Anti-social behaviour
27%
Violence and sexual
offences
20%
Vehicle crime
10%
Other theft
9%
Burglary
7%
Criminal damage
and arson
5%
Public
order
4%
Drugs
4%
Theft from the person
4%
Shoplifting
4%
Robbery
3%
Bicycle theft
2%
Other crime
1% Possession of weapons
0%
Anti-social behaviour
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
Criminal damage and
arson
Public order
Drugs
Theft from the person
Shoplifting
Robbery
Bicycle theft
Other crime
Possession of
weapons
Pie Chart
Crime broken
down by crime
type
39. PieChartsbyCluster
Then we have the same analysis but using the data from the
separate Clusters.
There is a degree of consistency in many of the crime type
proportions regardless of which category of area is being
analysed. Some exceptions are:
InWhite-Collar Urban areas there is more bicycle theft and
theft from the person.
In Suburban areas there is less crime overall but
proportionately more burglary and vehicle crime.
In Deprived Urban areas there is more drugs, violence and
sexual offences.
40. Pie
Charts
By
Cluster
White-collar Urban
Anti-social behaviour
Violence and sexual offences
Vehicle crime
Other theft
Burglary
Criminal damage and arson
Public order
Drugs
Theft from the person
Shoplifting
Suburban
Anti-social behaviour
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
Deprived Urban
Anti-social behaviour
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
40
Exceptions are:
InWhite-Collar Urban areas
there is more bicycle theft and
theft from the person.
In Suburban areas there is
less crime overall but
proportionately more
burglary and vehicle crime.
In Deprived Urban areas
there is more drugs, violence
and sexual offences.
There is a degree of
consistency in many of
the crime type
proportions regardless
of which category of
area is being analysed.
41. 2B-LinearRegression
To get a feel for the relationships between pairs if variables
and their strength, I chose to analyse the data using linear
regression.
1The independent variables (demographics from the
census) are displayed on the x-axis
2The dependent variable (Recorded Crimes per Head) are
displayed on the y
I visualized the data in Excel with a trend line added
I calculated the Slope and Correlation values for each using
the Excel functions.
42. 0
1
2
3
4
5
6
7
8
9
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
Crimes
Per
Head
Occupation AB PerCent
0
1
2
3
4
5
6
7
8
9
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%
Crimes
Per
Head
Over - Occupied PerCent
-1
0
1
2
3
4
5
6
7
8
9
0 0.5 1 1.5 2
Crimes
Per
Head
Cars Per Household
0
1
2
3
4
5
6
7
8
9
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
Crimes
Per
Head
Houses PerCent
44. LinearRegressionResult
Bedrooms per House was the clearest socio-economic
indicator of those tested to demonstrate a relationship with
crime levels, with a Correlation Coefficient of -0.43.
Cars per Household and Houses Per Cent also showed a
correlation.
All of the indicators show a negative trend line (except
over-occupancy) which is as expected.
Over-occupancy had a weak correlation and the Occupation
AB Per Cent showed no overall correlation.
The outliers were identified as locations such as theWest
End and Heathrow Airport, which are crime locations with a
small resident population.
I also analysed each specific CrimeType within each
variable.
46. CrimeTypeCategorisation
Each CrimeType can be categorised according to the degree of correlation they
had with the Bedrooms Per House indicator.
A Correlation of greater than 0.4 is categorised as more residential in nature.
A Correlation of less than 0.25 is categorised as no more likely in residential
areas (so I categorise that asTown-centre)
Category CrimeTypes
Residential ASB, Public order, Criminal Damage, Drugs, Possession of
Weapons,Violence and sexual offences
Neutral Burglary and robbery
Town-Centre Shoplifting,Vehicle Crime and theft from the person
47. 2C – Data Mining
I wanted to investigate further whether the features chosen
so far would be able to provide a predictive capability using
the SSAS data mining capability.
Data mining would enable decision tree and multi-variate
capability, so it should be possible to tease out more subtle
relationships.
Also, by slicing the data cube by Crime Type Category we can
see how the results vary.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 47
48. The Data Mining Model
I set up a Mining Structure based on the
MSOA dimension with MSOA Code as key
Selected Crimes Per Head as the dependent
variable – to be predicted.
Selected the Census measures, including
Cluster Name, as the independent variables
(Input).
I used the default of reserving 30% of records
for the testing set.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 48
49. Decision Tree Analysis
Run the Mining Model based on Decision Tree algorithm.
The Tree show a initial splits based on Bedrooms Per
House, and substantial variety in subsequent splits and
influencing variables.
Cluster Name does not feature as an influencing variable
(as you may expect since it is derived) but it is used in
many decision points.
The Mining Legends shown are those with the most cases
from the second level.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 49
51. Dependency Network
The Dependency Network
identifies Bedrooms Per House
as the strongest link, followed by
Cluster Category.
The predictive capability is
genuine but is not strong.
1
2
3
4
5
6
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 51
53. Testing the Model
By applying data slices to restrict the data used by Crime Type
category, better predictive capability can be achieved.
This is particularly true for the crime types that we previously
categorised as Residential crime.
Crime Type Category Score
All 0.33
Residential 0.78
Neutral 0.51
Town-Centre 0.41
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 53
56. Conclusions
1. The clustering of areas by demographics did provide a
useful additional indicator for the model.
2. The data mining confirms the regression analysis
conclusion that Bedrooms Per House is the strongest
demographic indicator.
3. The Bedrooms Per House impact and the model generally
has predictive capability for a subset of crime types.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 56
57. Opportunities for Further Analysis
1. There are other demographic indicators available which I have
not yet explored using this crime dataset.
2. There is an archive of crime data which I have not accessed. It
would be interesting to explore whether there is any substantial
difference by shifting the time frame backwards (closer to the
2011 census date).
3. The Ministry of Housing, Communities & Local Government,
publishes its own indexes of deprivation at LSOA level. While
they constitute a more limited set of features, it may be that
relationships at that level of granularity do become more
apparent.
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 57