This document summarizes the process of cleaning and analyzing transportation data from the Quarterly Labour Force Survey to determine the primary modes of transportation used to commute to work across regions in Great Britain. Key steps included removing non-working respondents, simplifying transportation and region categories, and adding an age variable. Cross-tabulations and graphs were produced to show differences in transportation usage between regions and age groups, such as higher car usage in less dense areas and among older commuters. Weighting was applied in some analyses to account for sample sizes.
1. Cobain Schofield 200923027 ENVS450 Assignment1
Page 1
ENVS450 Assignment1 – DescriptiveStatistics
Due: 12:00 24th October 2016
Cross-tabulation to find the mode of transport used to travel to work
The first cross-tabulation was undertaken between two variables of the Quarterly Labour Force
Survey (QLFS) dataset:
- $TravelMode - initially presented as a 10-level factor containing categories of
transportation type used by the individuals surveyed
> levels(qlfs$TravelMode)
[1] "Non-working adult" "Car,van,minibus,works van"
[3] "Motorbike,moped,scooter" "Bicycle"
[5] "Bus,coach,private bus" "Taxi"
[7] "Railway train" "Underground train,light railway,tram"
[9] "Walk" "Other method"
- $GovtRegion - initially presented as a 20-level factor containing different
administrative areas
> levels(qlfs$GovtRegion)
[1] "Tyne and Wear" "Rest of North East"
[3] "Greater Manchester" "Merseyside"
[5] "Rest of North West" "South Yorkshire"
[7] "West Yorkshire" "Rest of Yorkshire & Humberside"
[9] "East Midlands" "West Midlands Metropolitan County"
[11] "Rest of West Midlands" "East of England"
[13] "Inner London" "Outer London"
[15] "South East" "South West"
[17] "Wales" "Strathclyde"
[19] "Rest of Scotland"
The aim of this cross-tabulation was to find the mode of transport used to travel to work by
individuals from all regions of Great Britain.
The first issue arises from the use of $TravelMode which contains level [1] “Non-working
adult” which is not applicable to the investigation. Running a count of the number of
instances of each factor returns the following:
> tally(~TravelMode, data=qlfs, margin='col', na.rm=TRUE)
Count
TravelMode All
Non-working adult 34497
Car,van,minibus,works van 10973
Motorbike,moped,scooter 114
Bicycle 461
Bus,coach,private bus 1028
Taxi 49
Railway train 776
Underground train,light railway,tram 372
Walk 1604
Other method 81
Total 49955
2. Cobain Schofield 200923027 ENVS450 Assignment1
Page 2
This shows that almost 70% of the total responses are invalid. The QLFS dataset was then
copied to a new dataframe (qlfs_clean), minus any row containing “Non-working adult”:
> qlfs_clean<-qlfs[qlfs$TravelMode!="Non-working adult",]
> qlfs_clean$TravelMode <- factor(qlfs_clean$TravelMode)
This now meant that all entries in the dataframe were valid candidates for the cross-tabulation.
However, the number of factors in $TravelMode is too great for meaningful interpretation,
and it can be simplified by merging similar categories. The following recode() was run,
reducing 10 complex levels to 7 simpler ones:
[1] “Non-working adult” --> [x] dropped
[2] “Car,van,minibus,works van” --> [1] “Car/Van”
[3] “Motorbike,moped,scooter” --> [2] “Motorcycle”
[4] "Bicycle” --> [3] “Bicycle”
[5] “Bus,coach,private bus” --> [4] “Bus”
[6] “Taxi” --> [5] “Other method”
[7] “Railway Train” --> [6] “Local metro/Tram/Train”
[8] “Underground train,light railway, tram” --> [6] “Local metro/Tram/Train”
[9] “Walk” --> [7] “Walk”
[10] “Other Method” --> [5] “Other method”
Most merges above are logical, although merging [6] “Taxi” into [5] “Other method”
may appear strange. The reason for this is that the low count of “Taxi” means that when rates
are calculated at a later point, taxi usage will appear negligible compared to other categories.
The next issue is that $GovtRegion contains a mixture of administrative region types. For the
purposes of this cross-tabulation, the 19 categories will be recoded into the 9 principle regions
of Great Britain by running the following recode():
[1] “Tyne and wear” --> [1] “North East”
[2] “Rest of North East” --> [1] “North East”
[3] “Greater Manchester” --> [2] “North West”
[4] “Merseyside” --> [2] “North West”
[5] “Rest of North West” --> [2] “North West”
[6] “South Yorkshire” --> [1] “North East”
[7] “West Yorkshire” --> [1] “North East”
[8] “Rest of Yorkshire and Humberside” --> [1] “North East”
[9] “East Midlands” --> [3] “Midlands”
[10] “West Midlands Metropolitan County” --> [3] “Midlands”
[11] “Rest of West Midlands” --> [3] “Midlands”
[12] “East of England” --> [4] “East Anglia”
[13] “Inner London” --> [5] “London”
[14] “Outer London” --> [5] “London”
[15] “South East” --> [6] “South East”
[16] “South West” --> [7] “South West”
[17] “Wales” --> [8] “Wales”
[18] “Strathclyde” --> [9] “Scotland”
[19] “Rest of Scotland” --> [9] “Scotland”
Recoding this factor into less categories means that it will be easier to interpret tables and
graphs that are produced later. It also means that these tables and graphs will show an
3. Cobain Schofield 200923027 ENVS450 Assignment1
Page 3
accurate representation of each region because all smaller “child” areas have been combined
into the larger “parent” region. For example, before recoding took place, “Rest of North East”
would have contained data from individuals within the North East, but outside “Tyne and
Wear”, “South Yorkshire”, “West Yorkshire” and “Rest of Yorkshire and Humberside”, leading
to a confusing data output.
Table 1 shows the cross-tabulation between $TravelMode and $GovtRegion following the
aforementioned data processing. The output is in percentage of individuals surveyed within
each region. The data in the table says a great deal about how people travel to work, and
shows contrasts that one might expect to find between different regions, for example: a higher
rate of people using Cars/Vans owing to the low-density population and low-urbanisation,
whereas London shows a significantly lower rate of Car/Van usage because of factors
including well-integrated mass-transit and discouragement from using personal vehicles
through schemes such as the congestion charge. London is the only region in Great Britain
where public transport and walking exceeds car and van usage.
Figure 1 shows a graphical representation of the data in Table 1. The data is displayed in a
stacked bar chart to show the differences between transportation usage in each region clearly.
A stacked bar chart was chosen over a clustered bar chart because the nature of the data
being in percentages means that all of the bars will stack to the same extent, aiding
interpretation of the data. The segments in each bar are stacked in the same order, and the
colour scheme is generated using RColorBrewer. The colour scheme uses a single-hue
palette and the stacking is systematic because it does not discriminate against those with
visual impairments such as colour blindness (Cynthia et al, 2003).
When processing the data to generate Table 1 and Figure 1, the decision was taken not to
use any re-weighting of the data because the rates of transportation usage are not too
Table 1 – Mode of transport taken to get to work across Great Britain
Mode of
Transport
Region National
North
East
North
West
Midlands
East
Anglia
London
South
East
South
West
Wales Scotland
Average
% Individuals
Car/Van 74.82 75.64 79.07 71.64 33.25 72.69 76.76 80.38 70.51 70.53
Motorcycle 0.44 0.55 0.95 0.51 0.44 1.08 1.47 0.88 0.40 0.75
Bicycle 1.91 2.01 2.93 4.18 4.76 3.30 3.95 1.18 2.72 2.99
Bus 8.00 6.47 4.71 4.57 16.28 3.66 4.34 3.39 9.86 6.81
Other method 0.69 0.67 0.59 0.64 0.81 0.77 0.70 1.33 1.68 0.88
Metro/Tram/Train 3.24 3.24 1.98 8.23 36.26 8.22 1.08 1.77 4.25 7.59
Walk 10.90 11.42 9.77 10.23 8.20 10.29 11.7 11.06 10.58 10.46
Total Count 2037 1638 2527 1555 1597 2215 1291 678 1248 1642
Source: Quarterly Labour Force Survey
4. Cobain Schofield 200923027 ENVS450 Assignment1
Page 4
dissimilar to official Department for Transport (2014) usage figures. Given the relatively small
numbers involved in QLFSand their likeness to official statistics,re-weighting is not necessary.
Figure 1 – Mode of transport taken to get to work across Great Britain
Post-stratigraphic reweighting was actually carried out to evaluate the overall impact of
reweighting on the dataset, and the resulting re-weighted data showed negligible differences
to the raw data, as Figure 2 illustrates.
Figure 2 – A re-weighted version of Figure 1
5. Cobain Schofield 200923027 ENVS450 Assignment1
Page 5
A potential reason for the lack of difference between weighted and non-weighted data could
be that only valid data was used from the initial instance, with all non-workers removed from
the dataset before any further data processing started. However, even including such
individuals within the weighting calculation proved to have very little influence on the new rates
computed.
Adding a new variable to the initial bivariate table
Following the investigation of the relationship between $TravelMode and $GovtRegion, a
third variable was added, to investigate how $TravelMode differs with $Age within each
$GovtRegion.
> str(qlfs$Age)
atomic [1:84692] 42 43 18 75 69 73 40 42 72 66 ...
> range(qlfs$Age)
[1] 16 99
The above code shows that the QLFS dataset contains respondents from age 16 to age 99
within a contiguous data structure. This is obviously a very broad spectrum, and it is not
practical to show 83 concatenations. The QLFS also contains $AgeGroup – which categorises
individuals into pre-defined age categories:
> levels(qlfs_clean$AgeGroup)
[1] "16-29" "30-44" "45-64" "65+"
This shows 4 distinct age groups of similar range, which is still a little too many to generate
any meaningful concatenation. 3 age groups would produce a table that is easier to read
without an overwhelming amount of numbers to try and interpret. 3 new age categories were
therefore formed:
- 16 to 34
- 35 to 59
- 60 +
> qlfs_clean$AgeCat <- group.data(qlfs_clean$Age, breaks=c(16,34,59,Inf))
These 3 new categories capture three different cross-sections ofsociety, which canbe broadly
simplified as “Young”, “Middle Aged” and “Nearing retirement/Pensioners”. The new
categories also contain a healthy number of respondents, which the “65+” category of
$AgeGroup did not, as it contained just 474 respondents across Great Britain. By contrast, the
“60+” category of the new $AgeCat variable contains some 1445 respondents nationally.
Table 2 is a 3-way cross-tabulation of the previous variables with the new $AgeCat variable.
For this cross-tabulation, weighting was calculated due to the lower numbers of individuals
within each of the 3 age categories. This will help the under-represented 60+ category have
equal weighting within the survey. This group of people make up some 17.7% of society
6. Cobain Schofield 200923027 ENVS450 Assignment1
Page 6
according to 2016 ONS data, but only make up around 10% of this survey. This reduction in
the number of respondents could be a result of selection-bias when the initial QLFS was run,
or could be a product of my removal of “Non-working adults”, as a considerable proportion of
the 60+ age category will contain retired workers.
When the data in Table 2 is graphed in a 3-way stacked bar chart, it produces the graph in
Figure 3. This shows a clear increase in car/van usage for each consecutive age group, most
prominent in London. This could suggest increased prosperity with age given the costs of
owning and running a vehicle, particularly in the capital. One can also infer from Figure 3 that
although public transport usage is higher in London than all other regions, walking and cycling
are in most cases less popular modes of transport than in other regions outside the capital.
Given that the London congestion charge was a policy option to promote physical activity
Table 2 - Mode of transport taken to get to work across Great Britain by age group
Age
Group
Mode of
Transport
Region
National
AverageNorth
East
North
West Midlands
East
Anglia London
South
East
South
West Wales Scotland
% Individuals
16-34 Car/Van 49.23 46.67 55.63 47.34 14.54 46.47 48.49 59.55 45.34 45.92
Motorcycle 0.79 0.74 1.65 0.60 0.66 0.99 1.92 1.98 0.67 1.11
Bicycle 2.90 2.09 4.26 6.63 4.63 5.45 2.84 0.70 2.12 3.51
Bus 10.21 8.86 7.03 8.41 17.73 6.26 5.56 5.68 12.81 9.17
Other method 1.59 0.50 1.86 1.21 0.44 2.41 3.89 2.68 3.39 2.00
Local
Metro/Tram/Train 3.51 3.57 2.14 6.16 40.30 7.50 0.00 2.72 5.27 7.91
Walk 31.78 37.56 27.44 29.65 21.71 30.92 37.29 26.74 30.41 30.39
Total Group
Count
612 468 684 404 548 601 295 194 355 4161
35-59 Car/Van 57.85 60.62 62.49 54.03 28.47 57.64 57.92 63.10 53.99 55.12
Motorcycle 0.52 0.64 1.13 0.81 0.53 1.68 1.80 0.65 0.48 0.92
Bicycle 2.29 3.09 4.09 5.81 7.43 4.44 6.19 2.05 4.39 4.42
Bus 7.11 6.18 4.61 3.09 16.39 2.62 4.36 2.76 8.85 6.22
Other method 1.88 2.32 1.31 1.91 2.69 1.70 0.91 2.62 3.89 2.14
Metro/Tram/Train 2.62 2.45 1.72 8.02 26.51 7.37 1.14 1.33 3.18 6.04
Walk 27.73 24.70 24.66 26.34 17.98 24.54 27.68 27.49 25.22 25.15
Total Group
Count
1233 1022 1625 944 932 1392 840 399 783 9170
60+ Car/Van 67.27 60.01 63.58 67.28 33.08 60.39 65.53 50.63 54.43 58.02
Motorcycle 0.00 0.87 0.60 0.00 0.00 0.59 1.73 1.33 0.00 0.57
Bicycle 3.11 1.83 2.52 2.14 3.10 1.24 4.56 1.14 2.40 2.45
Bus 13.19 5.16 3.55 6.33 16.69 5.52 3.68 3.41 12.56 7.79
Other method 1.49 1.75 2.42 1.37 3.97 2.39 1.75 8.12 9.19 3.61
Local
Metro/Tram/Train 1.52 2.38 0.82 5.10 13.45 4.85 1.78 0.00 0.78 3.41
Walk 13.42 28.01 26.52 17.77 29.71 25.02 20.97 35.10 20.64 24.13
Total Group
Count
192 148 218 207 117 222 156 85 110 1455
Grand Total
Count
2037 1638 2527 1555 1597 2215 1291 678 1248 14786
Source: Quarterly Labour Force Survey
7. Cobain Schofield 200923027 ENVS450 Assignment1
Page 7
(Maiback et al, 2009), the scheme appears to have not been completely effective. However,
car/van usage in London is much lower overall when compared to other regions.
Figure 3 - Mode of transport taken to get to work across Great Britain by age group
A surprising characteristic of the data in Figure 3 is that bus usage outside London appears
to be rather low, particularly within the 60+ age group, a large proportion of whom qualify for
concessionary travel passes for local public transport. This might be put down to selection-
bias within the surveyed individuals, but recent Department for Transport data (October
2016) shows an overall decrease in bus usage nationally, with BBC News reporting that bus
usage has “fallen to the lowest level in a decade”.
Overall, the 3-way graph breaks down the initial data in Table 1 into a more coherent and
useful form. It shows more clearly how different demographics, split by regional location and
age group, interact with different transport modes to travel to work. Table 1 can only go so
far in displaying basic trends, but these trends can be investigated at a much higher
resolution when the data is expanded through a third dimension. However, in doing so, one
must always consider the effects that the size of the dataset has on the final output, due to
the biases, levels of non-response and invalid entries that the dataset may contain.
8. Cobain Schofield 200923027 ENVS450 Assignment1
Page 8
References
Cynthia, A. (2003) “A transition in Improving Maps: The ColorBrewer Example” in U.S
Report to the International Cartographic Association, special issue of Cartography and
Geographic Information Science 30(2);155-158.
BBC News (19 October 2016) “Bus use across England falls to lowest level in decade”.
Published on 19th
October 2016, London, UK. Available at: http://www.bbc.co.uk/news/uk-
england-37691160 (Last accessed: 22nd October 2016)
Department for Transport (October 2016) “Journeys on buses per head by local authority”.
Department for Transport. Available at:
https://docs.google.com/spreadsheets/d/1sieTbQ7PEFLJZPBXNV69Hx8lYpCiwBl-
6bT2b_z799k/edit#gid=0 (Last accessed: 22nd
October 2016)
Department for Transport (2014) “Modal Comparisons (TSGB01)”, Statistical Datasets,
produced by Department for Transport. Available at:
https://www.gov.uk/government/statistical-data-sets/tsgb01-modal-comparisons (Last
accessed: 19th
October 2016)
Maibach, E., Steg, L., Anable, J., (2009) Promoting physical activity and reducing climate
change: opportunities to replace short car trips with active transportation. Preventive
Medicine 49, 326-7
Office for National Statistics (2016) “Overview of the UK population: February 2016”,
Population Estimates, produced by the ONS. Available at:
http://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationes
timates/articles/overviewoftheukpopulation/february2016 (Last accessed: 20th
October 2016)