This document summarizes an analysis of road accident data from the UK from 2005-2014. Some key findings include:
- Accident rates have gradually decreased over time but increased from 2012-2014. Car sales levels impacted accident rates.
- The most accidents occurred in Birmingham and other large cities. The most common accident days were Fridays and times were during morning and evening commutes.
- Factors like road type, speed limits, and age/gender of drivers influenced accident rates and outcomes. 30mph single carriageway roads near junctions had the highest rates.
- Younger drivers (teens) on motorbikes and cars saw increasing accident involvement in recent years. Overtaking maneuvers were a
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
UK Road Accident Analysis: 2005-2014
1. Analysis of UK Road Accidents
Author - Krishnendu Das
Student id - 28980980
Tutor – Yalong Yang
Introduction:
Almost one million of the world population die of road accidents. The following report explores
the accident data of United Kingdom (UK) from the year 2005 to 2014. Around twenty thousand
data recorded over the years provides significant insights to understand the accident behaviour and
factors influencing accidents in the UK.
Motivation:
While my stay in the United Kingdom for a year in 2015, I had the privilege to make some lifelong
friends who commute to work, daily. One of the most common problems that they regularly
complained about is the frequent road accidents which caused delays in their journey severely
impacting their work efficiencies. As a data scientist, exploring the various aspects of crashes in
the UK will allow me to extract critical information that can be used by the citizens to avoid future
disasters. The following are the few questions that are used to progress with the data analysis and
exploration:
• What is the condition affecting the accidents in the UK?
• How age, gender, the day of a week influences the accidents?
• What are the vehicles more prone to the accident?
Data Source:
As per UK’s data collection policy, STAT19 is a set of protocols that outlines the various
guidelines for the information collected when a crash happens. The entirety of the data is not public
due to data confidentiality. The information recorded in STAT19 has three distinct elements:
• Accidents.csv: Circumstance of the crash comprising of three components:
1. Date and time of the accident
2. Location of the accident with junction details
3. The condition of the area like weather and light
• Vehicles.csv: Record of the vehicle involved in the crash comprising of two
essential elements:
1. Age of the driver
2. Sex of the driver
• Casualities.csv: Details of the casualty comprising of:
1. The severity of the injury
2. The metadata definition of the variables in the data can be found at http://data.dft.gov.uk/road-
accidents-safety-data/Road-Accident-Safety-Data-Guide.xls. Link to data: https://bit.ly/2HwfIBK
Extra data files have been used to supplement the hypothesis derived from the analysis. Following
are the data files:
• VEH01: UK car sales data from licensed vehicles and new registration tables,
produced by Department for Transport.
Data: https://www.gov.uk/government/statistical-data-sets/all-vehicles-veh01
• RAS51012: the UK reported drink and drive data obtained from the department of
transport.
Data: https://www.gov.uk/government/statistical-data-sets/ras51-reported-
drinking-and-driving
• SPE0111: Data estimates of vehicle compliance with speed limits on roads
Data: https://www.gov.uk/government/statistical-data-sets/spe01-vehicle-speeds
Wrangling:
Wrangling the primary data was little challenging due to the size of the records. However, it was
more challenging to wrangle the supporting data. The main files were in CSV format and had
around 1.6 million data in each file. Wrangling was done entirely using R. The data were mostly
structured and contained more than 70 features. The following formatting strategies were followed:
1. The columns were renamed as it had multiple blank names
2. The files were read using read_csv
3. The three files were merged using dplyr package
4. Once combined, selected columns were transformed into a single data frame
5. The values in the column are factors and required joining with metadata
definitions to make more sense
The metadata file was difficult to wrangle as the file was .xlsx in format and R does not have
formal methods to read excel file. After analysing different packages to read excel file, XLConnect
is used to read the data. The excel file has more than 40 worksheets that contained the metadata
definition of the column values included in the primary data files.
It would have been difficult to convert all the 40 worksheets into a single data frame as the
dimension of each sheet is different. Hence a function was developed that read all the sheets and
store each of them as a list of data frames. The name of the worksheets contained
blank names but did not require any wrangling as the names were accessed using ``.
All the three files have a renamed common column: ‘Accident_Index’. Besides, vehicle and
casualty were related with an extra column: ‘vehicle reference’ which was also used to join them.
3. Merging the files require 15 minutes of processing. Three separate columns were derived from the
Date column in an accident.csv Preprocessing these columns will improve dplyr performance as
the data size is quite significant.
Wrangling the supporting files were more difficult as the data were unstructured. The data were in
.ods format and hence required particular attention. The files were converted to CSV and then read
into the R.
• VEH01: The file has the following structure.
Initial rows were skipped
1.The file was read using
XLConnect:: loadWorkbook
2.The Date string was cleaned
using a regular expression
3.Garbage values were
replaced with 0 instead of ‘…’
or ‘.’
4.Rows after 93 columns were
removed as the data involved
Britain records only
5.Finally, the different vehicle
columns were melted to get a
single column
• RAS51012: Snapshot of the structure of the file is provided below.
1. Same strategies were
used for this file as well as the
previous one
2. All the data were
extracted from each worksheet
and then transposed into a
single data frame
3. Data was fetched
from 2004 to 2015
4. • SPE0111: Snapshot of the structure of the file is provided below.
1. Wrangling this file was the
most challenging of all
2. Same strategies are followed
as per the above files
3. The heavy vehicles columns
had three sub-columns which
required aggregation and
converted into a single file
4. Data from all the sheets were
merged into a single data frame
with a year as one column
5. All the vehicle category columns were then melted into a single column
Data Checking
The following data cleaning techniques were used:
1. The categorical values were already factorised
2. There were age bands where drivers age was 0-6 and 7 to 11 with quite a high accident
count. Those values were cleaned as it is quite impossible to have a large number of
accidents with drivers of such age. Those values were replaced with the mode of the
distribution of the age band
3. There were missing values in all the columns which were fixed with imputation. The
missing values in the categorical data were replaced with their mode in the distribution
whereas that of the quantitative value were replaced with the mean of their distribution
4. There were few records where the value of the features was unknown like Gender, Road
Conditions, Light Conditions. They were imputed as well with mode imputation
5. The supporting data had missing values which were imputed with 0 as that would not
impact the entire distribution of the data
6. The percentage of the supporting files were converted to total aggregates
5. Exploration:
The exploration was started analysing the trend of the accident over the years.
Over the years the number of accidents has gradually reduced. However, accidents have
progressively increased from the year 2012 to 2014. There has been an increase of 5% in the
accident rate from the year 2013. Over the decade there has been an overall decrease of 26% in
accident rate as of 2014. The number of car sales every year was analysed to justify the above
trend. While classifying the severity of the accidents, the total count of severe and fatal crashes
has relatively remained the same over the years. However, there has been an overall decrease in
the number of Slight accidents by more than 33%. Over the decade, the maximum number of
Serious accidents happened in 2006 while the maximum Fatal crashes occurred in 2007. Slight
accidents contribute the most by 87.17%, followed by Serious accidents at 11.81% and Fatal
accidents at 1.02%. Car sales data for the UK was plotted over the years.
6. The number of car sales per year in
the above chart suggests that due to
the global financial crisis (recession)
the sales count has significantly
dropped from 2007 and gradually
increased from the year 2011. In
2012, UK witnessed a significant
surge in car sale which also overlaps
the accident trends over the years.
Thus, car sales have significantly
impacted UK car accidents.
The maximum number of accidents
happens in the district of
Birmingham, followed by Leeds,
Manchester and Glasgow. The top 20
cities affected by accidents in the UK
are shown in the left chart. Although
Birmingham has the highest number
of accidents, their police force ranks
less compared to the total number of
accidents they handle. Metropolitan police have dealt with the maximum number of accidents over
the years because they supervised 38 districts that are positioned around the central city of London.
The time of the accident was next analysed by Months, Weeks and Hours.
7. It can be seen that more or less the total number of accidents remained the same throughout the
year except during January and December. These two months have witnessed more accidents than
the other months because of the long holidays during these months when many would prefer to
take breaks and plan road trips leveraging the holiday. Thus, holiday seasons have an impact on
the number of accidents happening in the UK as well.
Now, the days of a week are explored for the accidents occurred. The figure above reveals that
most of the accident happens on Friday and the least on Sunday. It could be possible that most
people return home late at night, after recreation, being drunk. Drunk driving increases the rate of
accidents whereas people prefer to stay more at home on Sundays. This will become clearer if the
drunk driving data is explored for the recent years. The below graph shows the hourly drink drive
data on various days of a week from 2010 to 2016.
Although the data is a very recent one, it reveals the tendency of drunk driving more on Fridays
and Saturdays. The trend obtained suggests the peak time of drunken driving starts from 6:00 PM
8. and continues until midnight. Excluding the weekends, the following hourly accident trend is
obtained.
The above above graph shows that the accident trend follows the office hours trend. The dispersion
is bimodal, each mode has its peak during the start and the end of the office. As per the media
reports (Refer: http://www.bbc.com/news/uk-38026625), UK residents have an average commute
time of 2 hours which bolsters the above arguments. Consequently, it can also be correlated that
due to work stress, accidents in the evening are more than accidents in the morning. Thus, office
commuters significantly contribute to the number of accidents in the UK. So which type of accident
causes more casualties and where does it occur the most?
The index is the calculated percentage of the number of casualties divided by the number of
accidents. Higher the index, more severe is the category. The index for Fatal and Serious accidents
are relatively higher compared to the Slight accidents. What causes so much of causalities in the
first two sections and why there are so many slight accidents? Factors like Road type, speed limits,
weather, the area of driving, age, sex etc. will be analysed.
9. Road Conditions:
The above figure shows that the maximum number of accident occurred on a single carriageway
followed by dual carriageway. However, speed limits played a significant role. In dual
carriageway, 43.05% of the crash happens when the speed limit is 70mph whereas in single
carriageway more than 50% of the accident happens at the speed limit of 30 mph and 28% of them
occur at 60 mph. For single carriageway, the safer speed limit band is between 40 to 50 mph. It is
noteworthy to mention here that UK government changed the speed limit for Dual carriageway to
70 mph and single lane to 60 mph in 1977. As per the above findings, this decision certainly has
an impact on the total number of accidents in the UK.
Moreover, the accidents are more prone to places near a T or staggered junction followed by areas
which are within 20 meters to the T junction and crossroads.
The above figure shows that at a T junction 76% of the accident happens when there is a speed
limit of 30mph whereas it is 40% in case of junction within in 20 pts. 30 mph single carriage roads
near to the T-junctions are the most impacted points in the UK. This also corroborates to the fact
10. that unclassified roads in the UK which have a speed limit of 30 mph are the ones experiencing
the maximum number of accidents. The below graph confirms the statement.
Also, recent researchers around the world have revealed that around two-thirds of the crashes in
which people are killed or injured occur on roads which has a speed limit of 30 mph or less (refer:
http://www.carsfatal4.com/the-fatal-four/amani/ ). It has been observed that on 30 mph roads in
built-up areas, 45% of car drivers exceed 30 mph and 15% exceed 35 mph. It seriously increases
the risk of fatal injury and crash by 3.5 - 5.5 times (refer:
https://www.rospa.com/rospaweb/docs/advice-services/road-safety/drivers/inappropriate-
speed.pdf )
11. The above data exploration overlaps with the research outcomes. It could be seen that drivers tend
to over speed in all types of vehicles mostly by the cars followed by heavy goods vehicles(HGV).
Almost 40% of the total cars tend to overspeed whereas around 38% of HGVs tend to over speed
in a 30-mph speed limit road. ‘Car’ contributes more than 10% in over speeding on a 30-mph
speed limit road.
All the above condition makes unclassified roads, the most dangerous ones in the UK.
Vehicles and Driving Manoeuvres:
Vehicle and Driver is explored against their drive
maneuvering, age, sex etc. Examining the data, it
was found ‘cars’ contribute 76% to the accident
count followed by pedal cycle at 6% and Vans
(3.5-ton goods) at 5%. Segregating the accident
count by severity, it can be seen from the chart to
the right that for slight accident pedal cycles
contribute the most by 21%. However, fatal
accidents are mostly caused by Goods Carrier
(7.5 tons): 23%, followed by Motorcycle (500
cc): 22%. Serious accidents are mainly caused by
Motorcycle (500 cc) as well: 20%.
12. Thus, more severe accidents are caused by the motorbikes. From the above histogram on the left,
exploring the age distribution of motorbike drivers causing accidents, it was found that a huge
number of them are caused by teenagers. The reason could be that in the UK, the drivers are
initially required to pass a theory test rather than a practical test to get a two-wheeler license. This
strategy probably has the worst repercussion.
Extrapolating the above finding to all vehicles causing an accident, it lends the yearly trend on the
above right chart. It can be identified that gradually the number of teenagers aged between 11 to
15 causing accidents is substantially increasing from 2011 to 2014 whereas that of between age 16
to 20 are rising from 2013 after a steep decline in their number over the years The abrupt decrease
in the number of accidents due to motorbike drivers could be attributed to the stringent driving
license policies that are being enforced over the years excluding 125cc Motorbikes and pedal
cyclists (Refer: https://bit.ly/2r6Gx4q). As a result, more of the teenagers are licensing themselves
on bikes lesser than 125cc thereby gradually increasing their accident count. The below graph
exhibits the same trend.
13. Consequently, just vehicle type cannot define the cause of accidents. Exploring the vehicle
maneuver, it was found that ‘overtaking’ or ‘going ahead of others’ consumes the most of accident
cause. It is followed by driving movement – ‘turning right’. On analysis, it is found that accident
due to ‘turning right’ is more prevalent in unclassified roads. This could be due to the unclassified
roads are mostly single carriage ones without any partition in between causing a collision by
incoming traffic.
Going ahead of others contributes around 46% to the total number of an accident on all speed limit
roads. Thus, speed limits or overtaking does not have any impact on overtaking. On all routes,
drivers have the same tendency to go ahead of others causing accident. Exceeding on the offside
in a 15-mph highway is 25%, which is the highest amongst all the section.
Gender of Driver:
Sex of the driver can play a crucial role, impacting the count of accidents. Overall the years, the
trend for the number of male and female drivers have remained the same, with Male being the
dominant contributor. Men contributed to more than 63% of the UK accidents. The hourly
distribution of the gender is explored below:
The first section of the graph (next page) shows the spread of male and female drivers causing an
accident while commuting to work. The dispersion follows the office hour timings which suggests
that the rush starts around 7:00 AM in the morning and gradually wanes at around 10:00 AM. The
rush again spikes around 4:00 p.m. and ends at about 8:00 PM. This gives us an idea of the UK
office timings from 10 AM to 4 AM in most cases. The second section of the graph (next page)
shows the dispersion of the gender when the students/pupil drive themselves to school. The young
girls are very safe drivers compared to the boys.
14. The third section reveals something extraordinary. It shows the dispersion of accidents by gender
when parents drive their kids to and from the school. The female drivers are more prone to crashes
than the male drivers in this case. To explain this, the first section of the graph is explored. The
rush for the female drivers starts a bit late than the Male drivers while commuting to work. The
Male drivers begin early for their office. This could be because the females drive their kids to
school more than the males and must reach office within time simultaneously, causing an accident
due to rash driving – a hypothesis from the above trend that needs more research and analysis.
Light (darkness) and Weather Conditions:
The below chart reveals that 30 mph - unclassified roads are the ones that are most affected due to
darkness, causing an accident. Around 68% of the accident by darkness happens due to lights unlit
whereas 62% of them occurs due to no lighting. Wiltshire is the most affected city with no
lightning whereas the City of Edinburgh is most impacted due to lights unlit.
15. The below graph represents the impact of different weather conditions on the number of accidents.
It is evident from the below chart that ‘Darkness due to no lighting’ and ‘Fine, no high winds’ has
the maximum number of accidents. On the contrary to the popular belief, ‘Snowing + high winds’
and ‘unlit lights’ do not contribute much to accident percentage. This could be because people are
reluctant to drive in such snowy weather.
Due to ‘Snowing and High Winds’ and no lightning Pembrokeshire is the most accident-prone city
in this category. Analyzing the accident distribution over the map reveals that Scotland roads are
better lit up than Britain’s road. Also, roads connecting London have lights that are unlit compared
to other places which need proper supervision from the area administrators. Moreover, drives along
the coast of UK do not have lighting and requires more investigation.
Conclusion:
From the above analysis, it is imperative that Friday evenings experience more accidents than any
other hours of the day in a week, although the number of accidents has reduced over the years.
Further, the risk of a crash increases by 70% if the drive is on an unclassified road (30 mph speed
limit). The accident is least likely to occur if the driver maintains a speed of 40-50 mph in either
single or dual carriageway roads.
Besides, the driver should be more careful while taking a right turn on a single carriage road to
avoid accidents. More stringent practice by driving schools can be a wise alternative to tackle this
problem. Since the UK has many unclassified roads with a speed limit of 30 mph, its high time to
introduce more traffic lights near T-junctions for safer driving. To check accidents on motorbikes,
UK government should further restrict the licensing of driving 125cc bikes as their accidents are
increasing gradually over the years, since 2009.
England’s administration needs more supervision on street light maintenance of all the roads
connecting the city of London. Lastly, female drivers who drop their kids to schools on their way
to office should be more careful. Maximum accidents were caused by cars, where the purpose of
the commute was a daily job. In this regard, UK must introduce more trains in Birmingham, Leeds,
Manchester to reduce the accident density by private cars.
16. Reflection:
1. First hands-on experience in data exploration allowing to learn its various aspects
2. Realized how supporting data could be used to correlate trends and conclude
3. Thoroughly used R for wrangling and plotting and helped to get acquainted with dplyr for
extensive data preparation and analysis
4. An excellent opportunity to explore the data through various charts on Tableau
5. Approached in-depth dive analysis on the time aspects of the UK accidents and helped to
understand the impact of low granularity data on the study
6. Learned to implement time series plotting of a given data
7. Required to follow the road system hierarchy of the United Kingdom in detail
8. The different road aspects of the accident could have been analysed in a correlation matrix
9. Initially, the gender analysis was not insightful at all. Deep dive analysis of the same
element on a time series yielded more meaningful insights
10. The exploration comprised of a holistic approach to the United Kingdom as a country. An
interactive analysis of each city/districts through rich visualisation will allow bringing
more intelligent insights
11. Only selected features from the total 70+ features were analysed from the entire dataset
due to a shortage of time and page restrictions
12. Gained confidence to carry out future data exploration on the large dataset in personal
projects
Bibliography:
• https://www3.nd.edu/~steve/computing_with_data/24_dplyr/dplyr.html
• http://stat545.com/bit001_dplyr-cheatsheet.html
• https://github.com/tidyverse/dplyr/blob/master/R/colwise-mutate.R
• https://en.wikipedia.org/wiki/Roads_in_the_United_Kingdom
• https://en.wikipedia.org/wiki/Reported_Road_Casualties_Great_Britain
• http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-
software-and-data-visualization
• https://www.statista.com/statistics/633052/share-vehicles-speeds-30-mph-roads-gb/
• https://www.express.co.uk/life-style/cars/790615/car-crash-UK-accidents-most-
dangerous-roads-revealed
• https://www.licencebureau.co.uk/wp-content/uploads/road-use-statistics.pdf
• https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_
data/file/390167/Birmingham_Evidence_Pack__for_publication__FINAL.pdf
• https://www.nomisweb.co.uk/reports/lmp/la/1946157186/report.aspx#tabrespop
• https://www.gov.uk/government/statistics/road-conditions-in-england-2017
• https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_
data/file/4484/CPR1131-analysis-of-stats-19-data.pdf
• http://www.bbc.co.uk/news/uk-15975564