A modelling approach to establish whether or not there is a north-south divide in the UK in terms of home ownership. Data used included UK Census and UK Quarterly Labour Force Survey
Grade: 78%
1. ENVS450 – Assignment 3 200923027
Page 1
England’s North-South Divide: exploring the impact of socio-demographic
variables on the rate of home-ownership by geographic location
Introduction
The “north-south divide” is a widely debated
social phenomena in the UK, and is often used
to describe the cultural, social, political and
economic differences between the two halves
of the country, with the south generally
determined to be “out-performing” the north.
This has been officially recognised by the
current Conservative government with the
instatement of the Northern Powerhouse
project, which aims to re-balance economic
disparity between the north and south.
However, house prices between the north and
south differ enormously, with the average
house price in the south far-exceeding the
wages of all but the most senior employees,
making the prospect of mortgages completely
unrealistic for most the workforce.
Several factors were identified which would
realistically impact home ownership across the
whole of the UK. These were taken from a list
of census variables and include: rate of
professional employment, rate of households
without a car, rate of residents aged 65 plus,
rate of illness and location within the UK. Given
the nature of the census dataset, each variable
has 348 instances, existing once for each
district in England and Wales. These variables
were then used to determine whether there is in
fact a clear north-south divide in home
ownership in the UK.
Literature Review
Home-ownership in the United Kingdom is
somewhat hegemonic; there is a widely held
cultural expectation and desire to own a home.
This is despite the average house price
increasing by some 35% in the last 10 years to
over £216,000 (The Land Registry, 2017).
Because of this, home ownership has
decreased in the last 30 years as younger
people are “priced out” of the housing market
(Osborne, 2016). The Office for National
Statistics states that in 1991 36% of 16-24 year
olds owned their own home, falling to 9% in
2014. The 35-44 age group has also seen a
drastic fall from 1991 to 2014, from 78%
ownership to 59%. By contrast, home
ownership amongst older age groups has
increased. However, Osborne (2016) found that
overall, the proportion of ownership has fallen
across every part of the UK since the early
2000s and as of publication, England was
seeing the lowest levels of home ownership in
30 years.
Throughout 2016 there were multiple news
articles published highlighting the deepening
north-south divide in the UK as defined by
house prices (Fraser, 2016; Milligan, 2016;
Shaw, 2016; Lynch, 2015). Research
conducted by ‘e-moov’, an online estate agent,
defined the north-south divide based on a “clear
boundary” which snaked across the Midlands
from Bristol to Norfolk. Along this boundary, the
difference in average house prices is as much
as £160,000 between neighbouring counties (e-
moov, 2016). This disparity has led to the claim
that “house prices may permanently diverge
from earnings” causing increasingly
unaffordable houses (Gregoriou, et al., 2014).
However, given the complexity of the national
social demographic, and the additional
complexity of factors affecting home ownership
2. ENVS450 – Assignment 3 200923027
Page 2
rates across the country, the research best
describes the factors as “heterogenic” as they
vary from one part of the country to another
depending on a web of other variables
(Montagnoli & Nagayasu, 2013).
Methodology
A census subset used was taken from the 2011
census – a survey of England and Wales which
determined a resident population of 56.1 million
people (Office for National Statistics, 2011).
The dataset is not raw data, but rather a rate of
variable occurrence within the population of
each of the 348 districts in England and Wales.
An Ordinary Least Squares Regression
analysis was used in alignment with the
standard demographic approach to analyse
only variables that were statistically significant
to the model. The explanatory variables chosen
at the start of the study are as follows:
Rate of professional employment
$Professionals
Rate of households which do not own a
car $No_Cars
Rate of residents Aged 65 or more
$Age_65plus
Rate of Illness $illness
Location within England and Wales
$NorthMidlandsSouth
The statistical significance of these variables
was not known when they were selected, so
some may be subject to dismissal during the
statistical analysis. These variables were
selected based on sparse literature surrounding
factors concerning home ownership
$Owner_occupied (Montagnoli & Nagayasu,
2013), as well as using empirical reasoning.
All the explanatory variables are continuous,
except for $NorthMidlandsSouth which is
categoric. This variable was created by
grouping districts based on their region in the
UK. The categories are: North, Midlands,
South; with Midlands encompassing the area
along the north-south boundary described by
‘e-moov’ which lacks some clarity. It is hoped
that by the end of the analysis, the Midlands
category will identify more with either North or
South, rather than existing as its own unique
region, as this would suggest that there is
indeed a “north-south divide” when it comes to
house prices.
Results
Before the main regression analysis can begin,
it is important to gain some understanding of the
relationship between the outcome variable
$Owner_occupied and the continuous
explanatory variables. Each of the 4 continuous
explanatory variables was plotted against
$Owner_occupied, with a second graph
plotted to assess Skewness. Figure 1 shows
the graphs.
From Figure 1, it is evident that all variables
except $Age_65plus are not normally
distributed. Pearson’s correlation requires
normal distribution, so Spearman’s Rank
correlation coefficient must be used instead to
establish the correlation between the variables.
The rs results from the Spearman’s Rank
calculations are included on each graph. The
results from Figure 1 are perhaps not
surprising, apart from that of
$Professionals, which shows weak
correlation between the explanatory and
outcome variables. However, this is not yet
cause for concern as this analysis does not
3. ENVS450 – Assignment 3 200923027
Page 3
consider spatial distribution of the districts
within each region at this stage.
Figure 1 – the relationship of each explanatory
variable in relation to the outcome variable, and
their Skewness
Next, a Multivariate Linear Regression Model
was fitted to establish the variance between the
variables. The following code was run in R:
> lm(Owner_occupied ~ No_Cars + Pr
ofessionals + illness + Age_65plus
+ NorthMidlandsSouth, data=census)
This model was developed by creating 5
progressively more complex models, each one
incorporating an additional variable from the 5
explanatory variables. Table 1 shows how each
model fared in increasing the model’s statistical
significance using Akaike’s Information
Criterion (AIC). For AIC, the smaller the value,
the more significant the model is.
Table 1 – multiple regression model output of
AIC results for each progressive model
Model Additional Variable AIC
1 $NorthMidlandsSouth 2582.018
2 $Age_65plus 2381.025
3 $No_Cars 1894.315
4 $Professionals 1889.316
5 $illness 1871.359
Table 1 shows the AIC reducing with each
additional variable, but the reduction in AIC gets
smaller and smaller, particularly between
models 3 and 4 where there is only a 5 point
reduction in AIC. However, the reduction still
contributes to the understanding and outcome
of the model, even if it does increase the
complexity by 20%. Therefore, model 5 will
become the model used for this study.
From this model, the coefficients can be
examined to write the fitted model in readable
terms:
> coefficients(model)
The output of the above code is displayed in
Table 2, the data from which was used to write
the fitted model:
% Home Owners = 67.37 – 0.82 + 0.37 +
0.64 + 0.13 – 1.17 + 1.94
The R2
of the model can also be obtained:
> summary(model)$r.squared
[1] 0.8767051
The R2
value of 0.87 suggests that the model
has a good fit. However, such a high R2
value
4. ENVS450 – Assignment 3 200923027
Page 4
does not necessarily mean that the fit is good.
The residuals must be considered to check how
the data is distributed about the horizon.
> plot(resid(model))
This outputs the graph in Figure 2, which shows
that the residuals are fairly evenly distributed
throughout the plot, suggesting that there is in
fact a good fit within this model, and that the R2
value of 0.87 can be respected.
Table 2 – coefficients of the fitted linear
regression model of all predictor variables
Variable Coefficient
(Intercept) 67.37
$No_Cars -0.82
$Professionals 0.37
$illness 0.64
$Age_65plus 0.13
$NorthMidlandsSouth[S] -1.17
$NorthMidlandsSouth[N] 1.94
This now means that the fitted model is able to
explain 87% of the spatial variation in home
ownership, based on the explanatory variables.
With only one explanatory variable
($NorthMidlandsSouth) used, the model
can only explain 2.7% of the spatial variation in
home ownership, meaning that the remaining 4
explanatory variables increase the accuracy of
the model by over 84%.
> summary(lm(Owner_occupied ~ Nort
hMidlandsSouth, data=census))$r.sq
uared
[1] 0.02767543
> 0.027*100
[1] 2.7
The model can also be checked using AIC
which eliminates the issue of the R2
value
automatically increasing for each variable
added to the model.
> AIC(model)
[1] 1871.359
> AIC(lm(Owner_occupied ~ NorthMid
landsSouth, data = census))
[1] 2582.018
This shows that increasing the complexity of the
model reduces the AIC score from 2582 to 1871
meaning that the additional complexity is
statistically significant and worthwhile.
Figure 2 – a residual plot of the fitted model
with R2
value of 0.87
To validate the model thus far, the model
residuals were checked to ensure normal
distribution. Figure 3 shows the model residuals
plotted as a graph, displaying very slight
positive skewness. This can be calculated:
> skew(model$residuals)
[1] 0.04953776
The skewness value for this model is calculated
as 0.04, which is so slight that the distribution is
essentially a symmetric distribution.
A ‘QQ’ plot was also generated to check the
skewness of the model. Rodríguez (2016)
states that a QQ plot showing curvature would
indicate skew distributions. The QQ plot
generated for this model is shown in Figure 4.
The plot is slightly curved at each end, but
5. ENVS450 – Assignment 3 200923027
Page 5
broadly follows a straight line across the
majority of the points within the dataset
suggesting only minimal skew. Given the equal
mirrored curvature at each end of the graph in
Figure 4, this largely cancels, resulting in only a
slight overall positive skew, which is what the
graph in Figure 3 and the skew calculations
previously discussed indicate. This gives
reasonable confidence to move on to the next
stage in validating the model which is to check
for constancy in error variance.
Figure 3 – model residuals plotted to show the
skewness of the model. The model is
symmetrically distributed
Figure 4 – a QQ plot of the model showing
minimal positive overall skew
The constancy was checked using a ‘spread-
level’ plot which is displayed in Figure 5.
Figure 5 shows a near-horizontal line of best fit
and no clear curvature in the scatter plot (the
few points in the bottom right are not significant
compared to the bulk of points above), together
these two properties show a constant error
variance.
Figure 5 – a spread-level plot of the model
residuals showing a near-horizontal line of best
fit suggesting constant error variance
Next, the multicollinearity of the model is tested.
This checks whether the explanatory variables
used in the model are strongly correlated in
combination. To check the multicollinearity of
the model, the following code was run:
> sqrt(mean(car:::vif(model)))
[1] 1.417898
This value is well within the safe range
described by Kabacoff (2015), who describes
sqrt(VIF) values greater than 2.0 as concerning.
Next, the Ordinary Least Squares Regression
assumes that the relationship between each of
the explanatory variables and the outcome
variable are linear. A partial residual plot was
generated in Figure 6.
6. ENVS450 – Assignment 3 200923027
Page 6
Figure 6 – partial residual plots of the model
With the exception of the $Professionals
plot, each plot appears linear, accounting for
limited noise such as that in $Age_65plus.
However, on closer inspection the large
deviation from linearity in $Professionals is
caused by a single outlier (City of London
district) which has a much higher than average
rate of professional workers. This is hardly
surprising, and as it is only a single point, the
red line of best fit follows the expected trajectory
of the green line to the point of deviation. It can
therefore be said that there is no obvious
departure from linearity from any of the
explanatory variables in the model.
Given all of the checks made, the model
appears to be robust and statistically sound.
The model output was then used to conduct a
multivariate principle components analysis in
PAST, shown in Figure 7 and larger in Appendix 3
Figure 7 – a multivariate principle component analysis of the model, conducted in PAST
Figure 7 shows each of the 348 districts within
the census dataset plotted according to their
residuals on axis 4 and 5 ($Owner_occupied
and $Professionals) of Model 5. It shows
7. ENVS450 – Assignment 3 200923027
Page 7
how the districts are seated in relation to each
other and the variables of the model, colour-
coded by the region of the UK to which they
belong, as defined in the model design
($NorthMidlandsSouth).
It is clear from Figure 7 that there is a lot of
overlap amongst the districts from each region,
which is to be expected. The interesting regions
are those which extend away from the central
cluster, as these are the ones that move away
from the “average” and begin to define the wider
region.
Focussing first on the North, the districts are
pulled toward the right and down, suggesting a
greater influence from $illness and
$No_Cars than the other regions. Figure 8A
shows that the North does indeed have the
highest average rate of no-car ownership.
However, Figure 8B shows that the Midlands
has a higher rate of illness, with a peculiar
positive linear relationship between $illness
and home ownership in the South which defies
empirical thought and goes against the
negative relationship of the North and Midlands.
The North also experiences a fairly significant
pull upwards by $Age_65plus, which is aligns
with Figure 8D which shows that the North has
2nd
highest rate of people aged over 65, though
in Figure 7 a lot of those districts influenced by
a high rate of older people also see strong
influences from professional workers.
Next, the South appears to have a much
broader spread than the other two regions, but
with 184 districts, it is exactly twice the size of
the North (largely owing to the higher
population and therefore larger number of
districts). Figure 7 shows that the South has a
large cluster positioned toward the left of the
graph, which appears to traverse the length of
Figure 8 – separate regression models for each
explanatory variable against the outcome var.
A
B
C
D
8. ENVS450 – Assignment 3 200923027
Page 8
the Y-axis, suggesting strong influences from
an aging population, home ownership and no-
car ownership, with little impact from the rate of
illness and rate of professionals in each district.
Figure 7 is backed up by Figure 8D which
shows that the South has the largest range in
the variable $Age_65plus of any of the
explanatory variables. Figure 8A shows that the
rate of no-car ownership has the largest impact
on the South, while Figure 8C shows – perhaps
counter-intuitively – that increased rate of
professional employees within a district leads to
lower rates of home ownership. However, this
is most-likely a result of people living and
working within London districts, where house
prices are high and workers may rent
accommodation as they may be expecting to
move in line with work commitments.
Nevertheless, 8C supports Figure 7’s apparent
lack of influence from $Professionals.
Finally, Figure 7 shows that there is a pull to the
right side of the graph with the Midlands region.
This suggests high rates of illness and high
rates of professionals in the workforce. Figures
8B & 8C support this, showing that the Midlands
has the highest average rates of both illness
and professional workforce.
Conclusion
This study has found a number of key points:
The final model (Model 5 in Table 1) was
statistically significant and the model
validation steps show this.
There are great disparities between the
North and South across all variables
except $No_Cars, where each of the 3
regions had very well correlated plots,
such as that in Figure 8A.
The Midlands often aligns very closely with
the North, as Figure 8 shows well. Figure
7 displays a lot of overlap between the
Midlands and the North – much more so
than any region does with the South.
Home ownership in the South is impacted
differently to what the literature suggests,
and what may be reasonably expected –
for example: higher rate of professionals in
a district = lower rate of home ownership.
This indicates that the South is subject to
different pressures and factors than the
North when it comes to home ownership.
Given that the Midlands aligns so closely with
the North rather than the South, it would make
sense to group the two regions together, as the
report by ‘e-moov’ did. This would not only
serve to increase the statistical significance of
the North (by giving this region a similar number
of districts to the South), but it would also make
logical sense given that this study has proven
the districts classified as “Midlands” to actually
be statistically similar to those classified as
“North”.
What this essentially means is that there is
indeed a north-south property divide in England
and Wales. The boundary is clear: from Bristol
in the west, across Warwickshire and
Gloucestershire, across to Leicestershire and
Norfolk in the east – as stated by ‘e-moov’ in
their 2016 report.
Obviously, this study did not look at the spatial
economics of homeownership in a statistical
sense, though empirically it is understood and
respected that property costs more in the
South. This study instead focused on just a few
social variables from the 2011 census in order
to draw this conclusion. A future study would
benefit from greater complexity in the modelling
9. ENVS450 – Assignment 3 200923027
Page 9
(increased number of social variables) as well
as incorporating some economic factors to
paint a much clearer picture of the wider issues
that we as a nation face when it comes to home
ownership and house prices.
References
e-moov, 2016. The North-South Property
Divide Defined, Brentwood: e-moov.
Fraser, I., 2016. This map shows just how stark
the north-south property divide is. The
Telegraph, 30 November.
Gregoriou, A., Kontonikas, A. & Montagnoll, A.,
2014. Aggregate and regional house price to
earnings ratio dynamics in the UK. Urban
Studies, 51(13), pp. 2916-2927.
Kabacoff, R., 2015. R in Action: Data Analysis
and Graphics with R. 2nd ed. Greenwich, CT:
Manning.
Land Registry, 2017. House Price Index for
United Kingdom; January 2006 to January
2016. [Online]
Available at:
http://landregistry.data.gov.uk/app/ukhpi/explor
e
[Accessed 1 January 2017].
Lynch, R., 2015. North-South divide in house
prices is highest ever. The Independent, 30
December.
Milligan, B., 2016. North-South house price
divide hits record high. BBC News Business, 1
April.
Montagnoli, A. & Nagayasu, J., 2013. An
Investigation of Housing Affordability in the UK
Regions, Glasgow: Scottish Institute for
Research in Economics.
Office for National Statistics, 2011. 2011
Census: Key Statistics for England and Wales,
March 2011. [Online]
Available at:
https://www.ons.gov.uk/peoplepopulationandc
ommunity/populationandmigration/populatione
stimates/bulletins/2011censuskeystatisticsfore
nglandandwales/2012-12-11#key-points
[Accessed 26 December 2016].
Office for National Statistics, 2016. UK
Perspectives 2016: Housing and
homeownership in the UK. [Online]
Available at: http://visual.ons.gov.uk/uk-
perspectives-2016-housing-and-home-
ownership-in-the-uk/
[Accessed 25 December 2016].
Osborne, H., 2016. Home ownership in
England at lowest level in 30 years as housing
crisis grows. The Guardian, 2 August.
Rodríguez, G., 2016. Generalized Linear
Models. [Online]
Available at:
http://data.princeton.edu/wws509/notes/c2s9.h
tml
[Accessed 1 January 2017].
Shaw, V., 2016. Buyers vs sellers – the new
north-south divide on house prices. Mirror, 17
October.
10. ENVS450 – Assignment 3 200923027
Page 10
Appendix 1 :: R Code
###########################################################
#### Assignment 3 ####
### "England’s North-South Divide: exploring the impact of socio-
demographic variables on the rate of home-ownership by geographic location"
###
###########################################################
## Load Libraries & Data ##
source("functions.R")
load(file = "2011 Census.RData")
load(file = "QLFS.RData")
library(plyr)
load.package("mosaic")
load.package("reshape2")
load.package("ggplot2")
load.package("car")
load.package("scales")
load.package("MASS")
load.package("pls")
###########################################################
## Dataset to be 2011 Census ##
## Output variable chosen to be $Owner_occupied ##
## Explanatory vars to be $Professionals, $Age_65plus, $No_Cars and
$illness ##
## (explanatory vars chosen from literature and logic) ##
# Insert the vars into an array for later use #
explan.vars <- c("Professionals","Age_65plus","illness","No_Cars")
###########################################################
### Explore the relationship between outcome and explan. vars ###
## Function to generate a scatter plot with best fit line ##
generateXbyY <- function(inputX, inputY){
return(ggplot(data=census) +
geom_point(aes_string(x="Owner_occupied", y=inputY) ) +
geom_smooth(method = "lm",fullrange=TRUE,
aes_string(x="Owner_occupied", y=inputY)) +
theme_bw() +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()))
}
11. ENVS450 – Assignment 3 200923027
Page 11
# Output a scatter for each explan. var #
for (i in explan.vars) {
generateXbyY("Owner_occupied",i)
}
## Function to generate skew plots for explan. vars ##
generateSkew <- function(inputY) {
return(qplot(inputY, data=census, geom="histogram", binwidth=1))
}
# Output a skew graph for each explan. var #
for (i in explan.vars) {
generateSkew(i)
}
###########################################################
## Up to now, all vars appear to be suitable, though the ##
## Skewness indicates that Spearman's Rank must be used ##
## in place of Pearson's correlation coefficient ##
###########################################################
### Build the model; assess statistical significance ###
## Start with 1 var ($NorthMidlandsSouth - spatial) ##
## and then add consecutive explan. vars ##
model1 <- lm(Owner_occupied ~ NorthMidlandsSouth, data=census)
model2 <- lm(Owner_occupied ~ NorthMidlandsSouth + Age_65plus, data=census)
model3 <- lm(Owner_occupied ~ NorthMidlandsSouth + Age_65plus + No_Cars,
data=census)
model4 <- lm(Owner_occupied ~ NorthMidlandsSouth + Age_65plus + No_Cars +
Professionals, data=census)
model5 <- lm(Owner_occupied ~ NorthMidlandsSouth + Age_65plus + No_Cars +
Professionals + illness, data=census)
# Check whether there is stat. sig. between each model #
anova(model1,model2,model3,model4,model5)
# Check that AIC is reducing from one model to the next #
AIC (model1,model2,model3,model4,model5)
###########################################################
## Everything looks good, so adopt the most complex model ##
## as --the-- model for the study ##
model <- model5
###########################################################
### Begin the validation of the model - check it is ###
### actually statistically significant ###
## Output the coefficents and obtain R-sq. value ##
coefficients(model)
summary(model)$r.squared
#-- R-sq. = 0.87 #
12. ENVS450 – Assignment 3 200923027
Page 12
## Don't take R-sq. at face value - check residuals are ##
## distributed evenly! ##
plot(resid(model))
abline(0,0)
#-- Residuals appear evenly distributed about the horizon #
#-- all seems good so far #
summary(model)$r.squared*100
#-- 87.67 #
#-- this means that >87% of the variation is explained by the model #
summary(model1)$r.squared*100
#-- 2.76 #
#-- model1() only explains 2.67% of variation - model() is much better! #
AIC(model)
#-- 1871.359 #
AIC(model1)
#-- 2582.018 #
#-- This shows a great reduction from model1() to model() meaning #
#-- that additional model complexity = greater stat. sig. #
## Next, the skewness of the model can be checked. Given that the checks ##
## made up to now indicate a good model, skewness should be limited at most
##
skew(model$residuals)
#-- 0.049.. - basically negligible; points to symmetric distribution #
## Next generate a QQ-plot to check skewness ##
#-- expect to find good fit to line given skew value of 0.04 #
p2<-ggplot(model, aes(qqnorm(.stdresid)[[1]], .stdresid))+geom_point(na.rm
= TRUE)
p2<-p2+geom_abline(aes(qqline(.stdresid)))+xlab("Theoretical
Quantiles")+ylab("Standardized Residuals")
p2<-p2+ggtitle("Normal Q-Q")+theme_bw()
p2
#-- Indeed, all but a few points appear to follow the line #
## Spread-level plot to show residual fit when studentized ##
car:::spreadLevelPlot(model)
#-- the near-horizontal line of best fit is good as it shows a good
linearity #
#-- also the lack of curvature in the scatter indicates good distribution
of residuals #
## Multicollinearity ##
# Check whether the explan. vars are strongly correlated in combination #
sqrt(mean(car:::vif(model)))
#-- 1.41 ... this is good according to Kabacoff(2015 -- see References) #
#-- Kabacoff says >2.0 is concerning #
## OLS expects linear relationship of explan. vs. outcome ##
# Plot partial residual plots to check this #
car:::crPlots(model)
#-- good, as no obvious departure from linearity in any explan. var #
13. ENVS450 – Assignment 3 200923027
Page 13
###########################################################
## Model appears good; export data to PAST and plot fixed slopes ##
## to check each var in relation to each region simultaneously ##
###########################################################
## Function to generate fixed slope scatters of all regions for any explan.
var ##
generateFixedSlope <- function(inputX) {
ggplot(data=census) +
geom_point( aes_string(x=inputX, y="Owner_occupied",
colour="NorthMidlandsSouth") ) +
geom_smooth(method = "lm", se = FALSE, aes_string(x=inputX,
y="Owner_occupied", colour="NorthMidlandsSouth")) +
theme_bw() +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 20) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
}
# Output a fixed slope for each explan. var #
for (i in explan.vars) {
generateFixedSlope(i)
}
###########################################################