Analysis of London's Crime and Census Data

Analysis of
London’s Crime and
Census data
Pairview BI Developer Project
COLIN BARTRAM

Project Roadmap
Phase 1 Crime Data
Phase 2 Census Data
1A - Dashboard
2B - Linear Regression
2A - Clustering
2C - Data Mining
1B – Heat Map
DASHBOARD & ANALYSIS OF LONDON'S CRIME AND CENSUS DATA 2

Data Sources
• The Metropolitan Police Service (MPS) releases monthly
anonymised crime data.
• The Office for National Statistics (ONS) conducts censuses
every 10 years and releases demographic statistics based
on the responses. The latest available is from 2011.
• Geographic data was imported from ONS. The latest Local
Authority boundaries date from 2015.

Geographic granularity
MSOAs
LSOAs
Locations
(streets)
• MPS Crime location data includes Lower-Layer Super
Output Areas (LSOA).
• LSOA can be aggregated to various local authority units
and to Middle-Layer Super Output Areas (MSOA).
• MSOA tends to be the lowest level of geography at which
Census data is published.
• MSOA is a more consistent geographic level than local
authority. An MSOA is targeted to have between 5000 and
15000 population.

Objectives
My objectives are to:
a. Visualise London crime data.
b. Analyse crime stats to highlight any contributory
demographic factors.
The analysis techniques to be employed will include:
• Linear regression
• K-means clustering
• Data mining using decision trees

High Level Architecture
• SSIS will be used to load into a MSSQL database.
• T-SQL queries will be used to populate a data warehouse.
• Power BI will be used for visualisations
• SSAS will be used to develop a data cube.
• Excel will be used for reporting and linear regression
• Python will be used for clustering
• SSAS will be used for data mining and decision tree
analysis.
MSSQL
SSAS
SSIS
Python
Cloud

DataSources
The Metropolitan Police Service (MPS) releases monthly anonymised
crime data, alongside a history of the previous two years, at
https://data.police.uk/data/
They use street snap-points to geomask the locations and do not
identify the day of the month.
The MPS data includes locations at street level, including the LSOA
(Lower Layer Super Output Area) code, which allows the data to be
aggregated at LocalAuthority (LAD),Ward level or Middle-Layer Super
Output Areas (MSOA) level.
Geographic data is sources from https://data.gov.uk/dataset/bc0d1720-
0275-490d-a7da-d22e69495314/lower-layer-super-output-area-2011-to-
ward-2015-lookup-in-england-and-wales

Tools
 SSIS will be used to load into SQLServer staging
tables.
 T-SQL queries will be used to populate a
SQLServer data warehouse.
 (By keeping staging tables separate from the data
warehouse, transformation will be handled by SQL
queries for reasons of scalability and transparency).
 Power BI will be used to develop the dashboard.
MSSQL
DW
MSSQL
Staging
Power
BI
SSIS
Cloud

ETLProcess
Download and extract the required Crime and geographic data
using SSIS
Create PoliceStats database and staging tables on SQL Server.
Use SSIS to iterate over the folder structure to upload the
Crime data to SQL Server.
Use SSIS to upload the geographic data to SQL Server.
Extract
Extract
Transform
Load
Number of Months (CSVs) 39
Number of Locations 65,699
Number of Recorded Crimes 3,677,871

ETLProcess
The data is well formed and LSOA codes linking the two data
sources are 100% valid requiring no data cleansing.
MPS do not currently use the fields FallsWithin and Reported
By to indicate where other Police Services are involved
Crimes recorded by the MPS at geographical locations outside
of the Met Police area (London Boroughs) are included and could
distort the results.
I added ‘No Location’ records with zero keys in LAD,Ward,
MSOA, LSOA and Location tables and updated the Crime data
with missing location to reference the ‘No Location’ record.
Transform
Extract
Transform
Load

ETLProcess
Created Dimension and Fact schemas and the tables for the
SQL Server data warehouse.Added foreign key constraints and
indexes.
Merged data from staging tables to DataWarehouse.
To enable a London only visualisation and analysis of the data I
created views that limit the selection to a LAD code beginning
with ‘E09’ which indicates a London Borough.
Crime data is released monthly, so is added using the same
methods but without affecting data from the initial load.
Load
Extract
Transform
Load

DashboardVisualisationFeatures
Selection byWard Name (single select only).
Map display based on Latitude and Longitude.Tooltip on rollover to
include Location Name. Size of marker based on Crime Count. Filter
available on location.
CrimeTrend stacked area chart, displayingCrimeType illustrated by
colour, with Crime Count on y-axis and Month on x-axis.Allow
selection onYear, Quarter and Month, or on CrimeType.
Month slicer to allow selection of any start point and end point.
Selection by CrimeType, allowing multiple selection. Include Select
All option and ability to deselect.

LondonCrime
HeatMap
I also took a view of London
using a Heat Map as preparation
for further analysis.
Crime levels are generally higher
towards the centre, but there is
a patchiness which is more
prominent. There is a West End
hot spot.
The next phase will attempt to
explain the influencing factors.

Phase 2 Census Analysis
GoalsoftheAnalysis
To understand whether and
to what extent crime levels
and different types of crime
are affected by socio-
economic and demographic
factors.
Building on the Crime data
by adding MSOA-level
Census data.
Data
Warehouse
Phase 2
Census Data
ETL
Phase 1
Crime Data
ETL
Location
Clustering
Linear
Regression
Data
Mining

ReviewofCrimeStudies
 Crime data has been subject to quite extensive analysis using data
mining and machine learning with the objectives of crime prediction.
 Clearly a time series analysis using past crime data has the most
predictive power, but lacks insight or explicatory power.
 London Landscape provides combined crime, demographic and
socio-economic datasets for local authority use.
 A US study at LSOA level from the 1990s identified Poverty,
Residential instability, Housing and commuting, Income, Population
and Family Disruption as the ‘themes’ whose measures correlate
well.
 I decided to see what could be accomplished by loadingCensus data
and analysing at the level of MSOA within London.

 The first stage of feature selection involves a selection of the
Census data to be loaded.
 The census data presented includes stats using combinations of
features. These are not useful here, as we are interested in
separating the impacts of individual variables, so are ignored.
 I ignored stats which offer little chance of displaying much
variability at the level of MSOA. Factors such as Gender and Age
may strongly correlate with an individual’s propensity for
criminality, but people do not tend to concentrate sufficiently
based on those factors at the level of MSOA.
 The data was sourced from
https://www.ons.gov.uk/census/2011census/2011censusdata/bulk
data/bulkdatadownloads
CensusDataSelection

CensusDataExtract
Census Data Download CSVs
used
Fields Type of Data
Detailed
Characteristics 1
BulkdatadetailedcharacteristicsmsoaE&W
andinfo3.3
6 28 Status, Coupled Y/N, Family Type, No
of Cars, No of Bedrooms,
Occupation
Labour Market BulkdataLabourMarketMSOA3.5aandinfo 1 9 NS-SeC employment classifications
Detailed
Characteristics 2
BulkdataDetailedCharacteristicsMSOAdat
aforE&WLowerGeographiesandinfo1207
5 41 Shared Y/N, Central Heating Y/N,
Occupancy, Ethnicity, Religion,
Residence Y/N, Dwelling Type
The number of rows populated in each case was the number of MSOAs which is 7201.
The subset of this which will be used are the records that relate to London which number 983.

CensusDataProcessing
Select ONS data of individual stats for MSOAs
Also download geographic relationships between LSOA
and MSOA
Use SSIS to populate MSSQL Staging tables
Enhance the DataWarehouse to support census data
Cleanse the MSOA data where relationships out of date
MSSQL
SSAS
SSIS
Python
Cloud

DataCube
A data cube was created in SSAS to bring together crime data
aggregated to the MSOA level, with the census figures.
A Geography Hierarchy was added to relate the Dimensions Location,
LSOA and MSOA
Crime Per Head is a calculation of the Crime Count divided by
Population
Calculations were created for the variables to be measured
independent of population.
MSOAs
LSOAs
Locations
(streets)

FeatureSelection
For each feature selected for the model I created calculated
variables in the data cube.
My intention was that these should be representative of the
socio-economic nature of an area and its social assets,
1. Number of Bedrooms Per Property
2. Number of Cars Per Household
3. Percentage of buildings which are Houses
4. Percentage of persons in AB (Professional/managerial)
Occupations
5. Percentage of Houses over-occupied
These measures I then analysed in Excel.

MeasureStats
A quick view of the features
selected to see that they have a
reasonable distribution of data

MeasureScorecard
The measures provide variety and represent different
aspects of socio-economic asset availability
1. Bedrooms Per Property - measures the housing stock
including size of properties.
2. Cars Per Household - measures the wealth of the
population and their access to public transport
infrastructure.
3. Percent of houses - measures the nature of the housing
stock including the household living space.
4. Percentage of persons in AB (Professional/managerial)
Occupations – measures the skills and earning potential of
the local population.
5. Percentage of Houses over-occupied – measures the
interior household environment and personal living space.

StructureofPhase2
The measures selected will
be used as inputs to all the
subsequent analyses.
Additionally the outputs of
the clustering and linear
regression exercises will be
combined with the
measures for the data
mining exercise.
2A -Cluster
Location
Clusters
2B -Linear
Regression
Crime Type
Categories
2C - Data
Mining
Conclusion
Data
Warehouse
m
e
a
s
u
r
e
s

2A-LocationClustering
To supplement our quantitative measures, I wanted to derive
some qualitative data from the census data, to facilitate other
approaches for analysis and visualisation.
I decided to try to categorise the geographic areas (MSOAs) in
the census data.
I had no pre-conceived notion as to what these clusters should
represent, so I chose an unsupervised method.
The approach chosen is k-means clustering.

TechnicalApproach
Scikit-learn provide a python package with a relevant toolset
for k-means clustering.
As we will be comparing indicators that are measured in
differing units, some pre-processing of the data is required to
counteract that impact.This will be by means of the
StandardScaler function.
Standard Python 3 modules will be pandas for set processing
and matplotlib for the visualisations.
I also used Excel and Power BI to visualise and display results
MSSQL
SSAS
Power
BI
Python

TherelationshipsbetweenCarsPerHousehold,BedroomsperhouseandPercentageHouseslookquitestrong,
Includingthemallmaybecounter-productive,iftheyduplicatinganexistingfeature.
IselectedCarsperHouseholdasrepresentativeofallthreefortheclusteringexercise.
MeasurePre-Selection

ElbowMethod
By varying the number of
clusters and rerunning the
clustering algorithm we can plot
the results against the SSE (sum
of the squared error) – a
goodness-of-fit measure.
More clusters will give a better
fit, but we can see that the rate
of improvement reduces
significantly after three, so this is
a sensible number of clusters.

Havingrunour threemeasuresthroughtheclusteringalgorithmusingaparameterof3clusterstheresultscanbe
viewedusingaseriesoftwo-dimensionalviewsofthethree-dimensions.
ClusterResults

CategorisationResults
Category Colour Percentage of
Houses over-
occupied
Number of Cars
Per Household
Percentage of persons in
Professional or
ManagerialOccupations
Deprived Urban Red High Low Low
White Collar Urban Yellow Low Low High
Suburban Blue Low High Mixed
As a sanity-check on the clustering results I loaded the categories into Power BI, combining them with a
the geographic identifiers (latitude and longitude) to display the result as a map.

LondonCluster
Visualisation
Category Colour
Deprived Urban Red
White Collar Urban Yellow
Suburban Blue

CrimePerHeadbyCluster
Crime levels are significantly
lower in Suburban areas.
Are there any significant
differences between the Crime
types recorded, that depend on
the nature of the environment?
In Excel I merged the data cube
with the cluster categorisations
and displayed the results as a
series of Pie Charts.
First is a baseline Pie Chart
including all of the locations, with
CrimeTypes ordered by volume.
Crime Type
Deprived
Urban Suburban
White-collar
Urban All
Anti-social behaviour 0.184 0.097 0.167 0.155
Violence and sexual offences 0.146 0.082 0.111 0.118
Vehicle crime 0.061 0.051 0.061 0.058
Other theft 0.050 0.028 0.081 0.054
Burglary 0.037 0.031 0.046 0.038
Criminal damage and arson 0.033 0.021 0.027 0.028
Public order 0.030 0.016 0.029 0.026
Drugs 0.030 0.011 0.022 0.022
Theft from the person 0.018 0.003 0.042 0.021
Shoplifting 0.021 0.012 0.028 0.021
Robbery 0.020 0.007 0.021 0.017
Bicycle theft 0.010 0.002 0.018 0.010
Other crime 0.006 0.005 0.004 0.005
Possession of weapons 0.004 0.002 0.003 0.003
All Crime Per Head 0.65 0.37 0.66 0.58

Anti-social behaviour
27%
Violence and sexual
offences
20%
Vehicle crime
10%
Other theft
9%
Burglary
7%
Criminal damage
and arson
5%
Public
order
4%
Drugs
4%
Theft from the person
4%
Shoplifting
4%
Robbery
3%
Bicycle theft
2%
Other crime
1% Possession of weapons
0%
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
Criminal damage and
arson
Public order
Drugs
Shoplifting
Robbery
Bicycle theft
Other crime
Possession of
weapons
Pie Chart
Crime broken
down by crime
type

PieChartsbyCluster
Then we have the same analysis but using the data from the
separate Clusters.
There is a degree of consistency in many of the crime type
proportions regardless of which category of area is being
analysed. Some exceptions are:
InWhite-Collar Urban areas there is more bicycle theft and
theft from the person.
In Suburban areas there is less crime overall but
proportionately more burglary and vehicle crime.
In Deprived Urban areas there is more drugs, violence and
sexual offences.

Pie
Charts
By
Cluster
White-collar Urban
Violence and sexual offences
Vehicle crime
Other theft
Burglary
Criminal damage and arson
Public order
Drugs
Shoplifting
Suburban
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
Deprived Urban
Violence and sexual
offences
Vehicle crime
Other theft
Burglary
40
Exceptions are:
InWhite-Collar Urban areas
there is more bicycle theft and
theft from the person.
In Suburban areas there is
less crime overall but
proportionately more
burglary and vehicle crime.
In Deprived Urban areas
there is more drugs, violence
and sexual offences.
There is a degree of
consistency in many of
the crime type
proportions regardless
of which category of
area is being analysed.

2B-LinearRegression
To get a feel for the relationships between pairs if variables
and their strength, I chose to analyse the data using linear
regression.
1The independent variables (demographics from the
census) are displayed on the x-axis
2The dependent variable (Recorded Crimes per Head) are
displayed on the y
I visualized the data in Excel with a trend line added
I calculated the Slope and Correlation values for each using
the Excel functions.

0
1
2
3
4
5
6
7
8
9
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
Crimes
Per
Head
Occupation AB PerCent
0
1
2
3
4
5
6
7
8
9
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%
Crimes
Per
Head
Over - Occupied PerCent
-1
0
1
2
3
4
5
6
7
8
9
0 0.5 1 1.5 2
Crimes
Per
Head
Cars Per Household
0
1
2
3
4
5
6
7
8
9
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
Crimes
Per
Head
Houses PerCent

-1
0
1
2
3
4
5
6
7
8
9
10
1 1.5 2 2.5 3 3.5 4
Crimes
Per
Head Bedrooms Per House

LinearRegressionResult
Bedrooms per House was the clearest socio-economic
indicator of those tested to demonstrate a relationship with
crime levels, with a Correlation Coefficient of -0.43.
Cars per Household and Houses Per Cent also showed a
correlation.
All of the indicators show a negative trend line (except
over-occupancy) which is as expected.
Over-occupancy had a weak correlation and the Occupation
AB Per Cent showed no overall correlation.
The outliers were identified as locations such as theWest
End and Heathrow Airport, which are crime locations with a
small resident population.
I also analysed each specific CrimeType within each
variable.

LinearRegressionofDemographicIndicatorsvsCrimeperHead
Slope Correlation Coefficient Slope Correlation Coefficient Slope Correlation Coefficient Slope Correlation Coefficient Slope Correlation Coefficient
All -0.567 -0.427 -0.457 -0.352 0.056 0.012 -0.686 -0.371 1.601 0.212
Anti-social behaviour -0.148 -0.557 -0.125 -0.484 -0.076 -0.080 -0.184 -0.499 0.554 0.368
Violence and sexual offences -0.089 -0.502 -0.076 -0.439 -0.179 -0.281 -0.088 -0.356 0.411 0.408
Public order -0.026 -0.477 -0.021 -0.399 -0.008 -0.043 -0.030 -0.404 0.073 0.238
Criminal damage and arson -0.017 -0.477 -0.014 -0.402 -0.034 -0.262 -0.016 -0.309 0.067 0.326
Bicycle theft -0.019 -0.472 -0.017 -0.446 0.042 0.292 -0.029 -0.528 0.019 0.085
Possession of weapons -0.004 -0.467 -0.003 -0.432 -0.006 -0.195 -0.004 -0.382 0.017 0.374
Drugs -0.027 -0.462 -0.024 -0.416 -0.029 -0.138 -0.032 -0.396 0.130 0.391
Robbery -0.024 -0.390 -0.021 -0.360 0.009 0.042 -0.031 -0.363 0.079 0.230
Burglary -0.020 -0.382 -0.016 -0.318 0.044 0.236 -0.030 -0.411 0.016 0.052
Other theft -0.088 -0.284 -0.062 -0.202 0.152 0.135 -0.114 -0.263 0.083 0.047
Theft from the person -0.061 -0.242 -0.044 -0.177 0.120 0.132 -0.082 -0.235 0.055 0.038
Shoplifting -0.028 -0.238 -0.018 -0.161 0.036 0.085 -0.030 -0.187 0.025 0.038
Vehicle crime -0.014 -0.192 -0.013 -0.180 0.000 0.000 -0.014 -0.138 0.054 0.133
Other crime -0.003 -0.074 -0.001 -0.033 -0.014 -0.089 -0.001 -0.017 0.018 0.072
Bedrooms Per House Cars Per Household Occupation AB % Houses % Overcrowded %

CrimeTypeCategorisation
Each CrimeType can be categorised according to the degree of correlation they
had with the Bedrooms Per House indicator.
A Correlation of greater than 0.4 is categorised as more residential in nature.
A Correlation of less than 0.25 is categorised as no more likely in residential
areas (so I categorise that asTown-centre)
Category CrimeTypes
Residential ASB, Public order, Criminal Damage, Drugs, Possession of
Weapons,Violence and sexual offences
Neutral Burglary and robbery
Town-Centre Shoplifting,Vehicle Crime and theft from the person

2C – Data Mining
I wanted to investigate further whether the features chosen
so far would be able to provide a predictive capability using
the SSAS data mining capability.
Data mining would enable decision tree and multi-variate
capability, so it should be possible to tease out more subtle
relationships.
Also, by slicing the data cube by Crime Type Category we can
see how the results vary.

The Data Mining Model
I set up a Mining Structure based on the
MSOA dimension with MSOA Code as key
Selected Crimes Per Head as the dependent
variable – to be predicted.
Selected the Census measures, including
Cluster Name, as the independent variables
(Input).
I used the default of reserving 30% of records
for the testing set.

Decision Tree Analysis
Run the Mining Model based on Decision Tree algorithm.
The Tree show a initial splits based on Bedrooms Per
House, and substantial variety in subsequent splits and
influencing variables.
Cluster Name does not feature as an influencing variable
(as you may expect since it is derived) but it is used in
many decision points.
The Mining Legends shown are those with the most cases
from the second level.

Decision Tree

Dependency Network
The Dependency Network
identifies Bedrooms Per House
as the strongest link, followed by
Cluster Category.
The predictive capability is
genuine but is not strong.
1
2
3
4
5
6

Testing the Model
By applying data slices to restrict the data used by Crime Type
category, better predictive capability can be achieved.
This is particularly true for the crime types that we previously
categorised as Residential crime.
Crime Type Category Score
All 0.33
Residential 0.78
Neutral 0.51
Town-Centre 0.41

Residential Crime Category

Residential Crime Type

Conclusions
1. The clustering of areas by demographics did provide a
useful additional indicator for the model.
2. The data mining confirms the regression analysis
conclusion that Bedrooms Per House is the strongest
demographic indicator.
3. The Bedrooms Per House impact and the model generally
has predictive capability for a subset of crime types.

Opportunities for Further Analysis
1. There are other demographic indicators available which I have
not yet explored using this crime dataset.
2. There is an archive of crime data which I have not accessed. It
would be interesting to explore whether there is any substantial
difference by shifting the time frame backwards (closer to the
2011 census date).
3. The Ministry of Housing, Communities & Local Government,
publishes its own indexes of deprivation at LSOA level. While
they constitute a more limited set of features, it may be that
relationships at that level of granularity do become more
apparent.

Analysis of London's Crime and Census Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analysis of London's Crime and Census Data

Similar to Analysis of London's Crime and Census Data (20)

Recently uploaded

Recently uploaded (20)

Analysis of London's Crime and Census Data