Statistics for Geography and
  Environmental Science:
an introductory lecture course
   By Richard Harris, with material
          by Claire Jarvis
    USA: http://amzn.to/rNBWd5
      UK: http://amzn.to/tZ7fVu
Based on the textbook
Copyright notice
Statistics for Geography and Environmental Science:
an introductory lecture course, © Richard
Harris, 2011.
This course is available at www.social-statistics.org
and contains extracts from the publication Statistics
for Geography and Environmental Science by
Richard Harris and Claire Jarvis (Prentice Hall, 2011)
You are free to modify these slides for the purpose of
non-commercial teaching only, subject to the
following restrictions:
– This work, or any derivative of it, may not be stored or
  redistributed in any form, paper or electronic, other than to
  be available to students for their learning and
  education, with access to the material restricted to the
  institution to which those students belong.
– Any derivative must retain this copyright in full and at the
  beginning of the work. The words ‗Based on‘ may be
  inserted in the first paragraph.
– Permission to waive or modify these restrictions may be
  sought from the author (Richard Harris, School of
  Geographical Sciences, University of Bristol).
The modules

OVERVIEW
The modules

Module1 makes the case for knowing
about statistics as a transferable skill
and to be equipped for social and
political debate.
Module 2 is about using descriptive
statistics and simple graphical
techniques to explore and make
sense of data.
Module 3 discusses the Normal
curve, the properties of which
provide the basis for inferential
The modules

Module 4 is about the principles of
research design and effective data
collection.
Module 6 discusses the role of
hypothesis testing.
Module 7 is about regression
analysis.
The modules

Module 8 moves to modelling point
patterns, ―hotspot analysis‖ and ways
of measuring patterns of spatial
autocorrelation in data.
Module 9 looks at spatial regression
models, geographically weighted
regression and multilevel modelling.
Each module is explored more fully
in the accompanying
textbook, Statistics for Geography
and Environmental Science.
Module 1
(Extracts from Chapter 1 of Statistics for Geography
and Environmental Science)

DATA, STATISTICS AND
GEOGRAPHY
Module overview

To convince you that studying
statistics is a good idea!
Our argument is that data collection
and analysis are central to the
functioning of contemporary society
so knowledge of quantitative
methods is a necessary skill to
contribute to social and scientific
debate.
About statistics

Statistics are a reflective practice: a
way of approaching research that
requires a clear and manageable
research question to be formulated, a
means to answer that question,
knowledge of the assumptions of
each test used, an understanding of
the consequences of violating those
assumptions, and awareness of the
researcher‘s own prejudices when
doing the research.
Some reasons to study statistics

Reasons for human geographers
 – Data collection and analysis are central
   to the functioning of society, to systems
   of governance and science.
 – Knowledge of statistics is an entry into
   debate, informed critique and the
   possibility of creating change.
Some reasons to study statistics

Reasons for GI scientists
 – To address the uncertainties and
   ambiguities of using data analytical.
 – Because of the increased integration of
   mapping capabilities, data visualizations
   and (geo-) statistical analysis.
Some reasons to study statistics

Reasons for all students
 – They provide a transferable skill set
   using in other areas of research, study
   and employment.
 – There is a recognised shortage of
   students with skills in quantitative
   methods, especially within the social
   sciences.
Types of statistic

Descriptive
– Used to provide a summary of a set of
  measurements, e.g. the average.
Inferential
– Use the data at hand to convey information
  about the population (‗the greater
  something‘) from which the data are drawn.
Relational
– Consider whether greater or lesser values
  in one set of data are related to greater or
  lesser values in another.
Geographical data

These are records of what has
happened at some location on the
Earth‘s surface and where.
For many statistical tests the where
is largely ignored.
However, it is central to geostatistics
and to spatial statistics (as their
names suggest)
Some problems when analysing
      geographical data

Standard statistical tests assume that
each ‗bit‘ of data (each observation)
has a value that is not influenced by
any other.
However, we may often expect there
to be geographical patterns in the
data.
– Spatial autocorrelation: geographical
  patterns in the measurements
Some problems when analysing
      geographical data

Determining what causes what in a
complex and dynamic natural or
social system is extremely tricky.
Two things may be associated (e.g.
greater income inequality and more
non-recycled waste) without the one
directly causing the other.
Some problems when analysing
      geographical data

Data and structured forms of enquiry
can only tell us so much and may not
be appropriate to some types of
research for which a more
qualitative, participatory or less
representational approach may be
better.
Further reading

Chapter 1 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: types of statistics;
why error is unavoidable;
geographical data analysis; and
spatial autocorrelation and the first
law of geography.
Module 2
(Extracts from Chapter 2 of Statistics for Geography
and Environmental Science)

DESCRIPTIVE STATISTICS
Module overview

This module is about ―everyday
statistics‖, the sort that summarise
data and describe them in simple
ways.
They include the number of home
runs this season, average male
earnings, numbers unemployed,
outside temperature, average cost of
a barrel of oil, regional variations in
crime rates, pollution statistics,
measures of the economy and other
―facts and figures‖
Data and variables

Data
– A collection of observations:
  measurements made of something.
A variable
– Another name for a collection of data.
  Variable because it is unlikely that the
  data are all the same.
Data types
– These include
  discrete, continuous, and categorical
  data.
Simple ways of presenting data

Discrete data       Continuous data
Frequency table     Summary table
Bar chart (below)   Histogram (below, with a rug plot)
Frequency and summary tables
Information to include
         in a summary table

Measures of central tendency
(―averages‖)
– The mean and/or median
   •   The ―centre‖ of the data
Measures of spread and variation
– The range (minimum to maximum)
– The interquartile range (from ‗mid-
  spread‘ of the data)
– The standard deviation,s
More about the standard deviation

 Essentially a measure of average
 variation around the mean.
 It is also the square root of the
 variance.
 The variance is the sum of squares
 divided by the degrees of freedom
Boxplots

Are useful for
showing the
median,
interquartile
range and range
of a set of data,
for indentifying
outliers and also
for comparing
variables.
Other ways of classifying numeric
              data

 Nominal, ordinal, interval and ratio
 Counts and rates
 Proportions and percentages
 Parametric and non—parametric
 Arithmetic and geometric
 Primary and secondary
Further reading

Chapter 2 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: data and variables; discrete
and continuous data; the range;
histograms, rug plots, and stem and
leaf plots; measures of central
tendency; why averages can be
misleading; quantiles; the sum of
squares; degrees of freedom; the
standard deviation and the variance;
box plots; and five and six number
summaries
Module 3
(Extracts from Chapter 3 of Statistics for Geography
and Environmental Science)

THE NORMAL CURVE
Module overview

This module introduces the normal
curve, so called because it is how
many social and scientific data
appear distributed.
The normal curve

It is also known as
the Gaussian
distribution and is
often described as
‗bell-shaped‘
It is a family of
distributions all of
which have the
same probability
density function
(the same formula
defining their
shape).
The central limit theorem

The central limit theorem states that
the sum (and therefore average) of a
large number of independent and
identically distributed random
variables will approach a normal
distribution as the sample size
increases, even if the variables are
not themselves normally distributed.
Properties of a normal curve

Ranges from
negative to positive
infinity
Is symmetrical
around its mean
95% of the area
under the curve is
within 1.96
standard
deviations of the
mean
99% of the area is
within 2.58
standard
deviations.
Properties of a normal curve

Consequently, if a
data set is
approximately
Normal, the
probability of
selecting, at random,
an observation at
that is within 1.96
standard deviations
of the mean is p =
0.95, and the
probability it will be
within 2.58 standard
deviations is p
=0.99.
Standardising data (z values)

Data are
standardised if their
original
measurement units
are replaced with
units of standard
deviation from the
mean (z values).
It is a little like
converting a
proportion (0 to 1) to
a percentage (0 to
100): it doesn‘t
change the shape of
the data.
Standardising data (z values)

The z values are calculated by
subtracting the mean of the data from
each observation and then dividing by
the standard deviation.
Once data are standardised and
assuming they are approximately
normal then they can be compared
against the Standard Normal curve.
This is a special instance of a normal
curve that has a mean of zero and a
standard deviation of one.
It provides a model or benchmark for
the data.
Probability and the Standard
            Normal

The area between two z values
(under the Standard Normal) is the
probability of selecting an
observation randomly from the data
that will have a z value between
those two values.
That area can be determined using a
statistical table or equivalent.
Probability and the Standard
            Normal

See the worked
examples on pp.
62–70 of
Statistics for
Geography and
Environmental
Science
Some data are skewed but can
  often be transformed to
   approximate normality
The quantile plot

Useful (and better
than a histogram)
to check for non-
normality, such as
skew and the
presence of
outliers.
If the data were
normal they‘d be
distributed along
the straight line.
Further reading

Chapter 3 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: properties of normal curves;
the central limit theorem; probability
and the normal curve; finding the area
under the normal curve; skewed data
and the ‗ladder of transformation‘;
moments of a distribution; and the
quantile plot.
Module 4
(Extracts from Chapter 4 of Statistics for Geography
and Environmental Science)

SAMPLING
Module overview

It is rarely possible or necessary to
collect all possible data about
something that is being studied.
This module is about how to go
about collecting a sample of data that
is fit for a particular research task.
Sampling

It is common in geographical and
other research to gather a sample
(or subset) of data from a target
population.
The aim is for the sample to be
representative of that population.
Sampling bias occurs when the
sample favour some parts of the
target population more than
others, perhaps by sampling at an
unrepresentative time or place or
because of the data collection
The process of sampling

Define the research question
Review the related literature
Review the scope of the planned
study
Construct a sample frame
Select a sample design method
Review the design from
practical, ethical, safety and logistical
perspectives
Implement the design and collect the
Sampling methods

                             Sampling methods



         Non-probabilistic                      Probabilistic sampling
        sampling methods                              methods


Judgemental            Convenience         Simple                Systematic
                                           random



     Quota           Snowball               Clustered           Stratified
                                             random              random
Sampling methods

The different methods are outlined on pp.
94-105 of Statistics for Geography and
Environmental Science.
In general, random sampling methods are
preferred because the errors in the data
should be random too.
However, a random sample won‘t
necessarily offer a wide enough coverage
of the target population.
Therefore stratified samples may be used
which may themselves target
specific, representative places to reduce
the cost and ease the logistics of the data
collection.
Sampling error and sample size

The impression that is formed of the target
population depends on the sample of data
taken to represent it.
It is possible that a random sample
accidentally misrepresents the population
if it happens only to observe its most
unusual occurrences: it is susceptible to
sampling error.
The larger the sample (the more
observations there are) there smaller the
error is expected to be but with
‗diminishing returns‘
– the error is generally proportional to the square
  root of the sample size
Sampling error and sample size

The error is also a function of how
much the target population varies
– If it were exactly the same, everywhere, it
  wouldn‘t matter where the samples were
  taken
A larger sample is costly and more time
consuming to collect.
However, a small sample of a highly
variable population is unlikely to
generate any statistically meaningful
analytical results.
Sampling methods: issues and
        practicalities

Personal safety, gaining permission
from an ethics committee, what to do
about missing data.
Practical considerations
– Weight and/or volume of the
  sample, import/export
  restrictions, analytical costs
Instrument accuracy and scale
Bottom line: if your sample is no
good, your analysis won‘t be any
good either.
Further reading

Chapter 4 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: the target population;
representative samples; sampling
frames; sampling bias; metadata;
fitness for purpose and use; sample
design; sampling error and sample size;
sample size and replicates; and
measurement accuracy.
Module 5
(Extracts from Chapter 5 of Statistics for Geography
and Environmental Science)

FROM DESCRIPTION TO
INFERENCE
Module overview

This module is about inference.
Inference is at the heart of how and
why statistics developed.
It moves beyond simply summarising
data (the sample) to using those
summaries to gain insights into the
underlying system, process or
structure that the data are
measurements of (the population).
A population

Its meaning isn‘t restricted to ―everyone
who lives in a particular place‖ but can
be much more abstract.
– ―Every possible object (or entity) from which
  the sample is selected.‖
– ―The complete set of all possible
  measurements that might hypothetically be
  recorded.‖
Informally: the complete ‗thing‘ that you
are interested in study but which can‘t
be measured in its entirety.
Each sample changes our
impression of the population
The sample mean and
      the population mean

Assuming the sample is
representative (unbiased), it is
possible to estimate the true mean of
the population from the mean for the
sample.
– The population mean from the sample
  mean
But that estimate is sample
dependent
– Change the sample and you get a
  different estimate
Confidence intervals

It is improbable that the sample
mean is exactly equal to the
population mean
– And we wouldn‘t know even if it was.
   •   (unless we sampled the population in its
       entirety, in which case they‘d be no need to
       make an estimate!)
However, we can place a confidence
interval around the sample mean and
estimate the probability that
confidence interval contains the
population mean.
Confidence intervals
The width of a confidence interval

 The confidence interval is wider
  – The greater the probability you want it to
    contain the unknown population mean
  – The more variable the data are (the
    greater their variance / standard
    deviation)
  – The less data you have
The standard error (of the mean)

The standard deviation of the data
divided by the square root of the
number of observations gives an
estimate of the standard error (of
the mean) and is a measure of
uncertainty in the data
 – The greater the standard error, the
   greater the uncertainty
Why confidence intervals ‘work’

In principle, if a
population were
sampled a very large
number of times, the
sample mean
calculated in each
case and those means
then collected together
to form a new
variable, we‘d find that
variable to be normally
distributed, centred on
the population mean
and with a standard
deviation equal to the
standard error of the
mean.
Small samples

For small samples the
confidence interval will
be underestimated if it is
calculated with reference
to a normal distribution.
A t-distribution is used
instead.
This is ‗fatter‘ than the
normal.
Intuitively: we are more
cautious with small
samples that contain little
information. The
confidence intervals are
widened to reflect that
caution.
Summary

Mean of the sample                 Known
Standard deviation of the sample   Known
Standard deviation of the          Unknown but approximated by
population                         the standard deviation of the
                                   sample
Standard error of the mean         Estimated as the standard
                                   deviation of the sample divided by
                                   the square root of the sample size
Mean of the population             Unknown but we can estimate the
                                   probability that it has a value that
                                   lies within a given number of
                                   standard errors either side of the
                                   sample mean (within a given
                                   confidence interval)
Further reading

Chapter 5 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: inference, samples
and populations; the distribution of
the sample means; standard error of
the mean; confidence intervals; the t-
distribution and confidence intervals
for ‗small samples‘
Module 6
(Extracts from Chapter 6 of Statistics for Geography
and Environmental Science)

HYPOTHESIS TESTING
Module overview

This module introduces hypothesis
testing as a way of formally
questioning whether a population
mean could plausibly be equal to a
hypothesised value, and to consider
whether two or more samples of data
were most probably drawn from the
same population.
The process of hypothesis testing

 Define the null hypothesis
 Define the alternative hypothesis
 Specify an alpha value
  – The maximum probability of rejecting the
    null hypothesis when it is, in fact, correct.
 Calculate the test statistic
 Compare the test statistic with a critical
 value
 Reject the null hypothesis is the test
 statistic has greater magnitude than the
 critical value
One-sample t test

The one-sample t test measures the
number of standard errors the sample
mean is from an hypothesised value.
The further it is, the less probably the
sample is drawn from a population with a
mean equal to that hypothesised value.
The p value records that probability.
A p value of 0.95 or greater means we can
be (at least) ―95% confident‖ that the ―true
mean‖ (the population mean) for whatever
has been measured is not the
hypothesised value.
A p value of 0.99 or greater gives 99%
confidence
Two-sample t test

Considers the probability that two samples
of data do not have the same population
mean.
– If they don‘t, it suggests the samples measure
  categorically different things.
It works by measuring the difference
between the sample means relative to the
variance of the samples.
There are different versions of the t
test, for example for paired data and for
whether the two samples have
approximately equal variance or not.
An F test is used to compare the sample
variances and see if any difference could
be due to chance.
Analysis of variance (ANOVA)

Used to test whether
three or more groups
of data have the
same population
mean.
Considers the
variations between
groups relative to the
variation within
groups.
Contrasts can be
used to specifically
contrast one or more
of the groups with
one or more of the
Two- and one-tailed tests

A two-tailed test is
non-directional
whereas a one-tailed
test is directional.
Consider a one-
sample t test
– The alternative
  hypothesis for a two-
  tailed test is only that
  the population mean
  is not equal to the
  hypothesised value.
– A one-tailed test it
  specifies which is the
  greater
Non-parametric tests

Non-parametric
tests do not begin     Parametric    Non-
with fixed                           parametric
assumptions about      Two-samplet   Wilcoxon rank
how the data and       test          sum test (aka
                                     Mann-Whitney
the population are                   test)
distributed            ANOVA         Kruskal-Wallis
– E.g. a normal                      test
  distribution
However, if the
assumptions are
met, it is better to
use the parametric
test.
Possible outcomes of a statistical
              test
Power

We worry about limiting the probability of
rejecting the null hypothesis when it is
correct (of making a wrong decision)
– Of having a low p value
But we could avoid the error by never
rejecting the null hypothesis
Except, that‘s daft because the null
hypothesis could be wrong.
So, also need to think about the probability
of rejecting the null hypothesis when it is
indeed wrong
– The probability of making this, the right
  decision, is the power of the test.
Further reading

Chapter 6 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: type I errors; the one-
sample t test; hypothesis testing; two-
and one-tailed tests; type II errors and
statistical power;
homoscedasticity, heteroscedasticity
and the F test; analysis of variance;
measuring effects; and parametric and
non-parametric tests.
Module 7
(Extracts from Chapter 7 of Statistics for Geography
and Environmental Science)

RELATIONSHIPS AND
EXPLANATIONS
Module overview

This module looks at relational
statistics, exploring whether higher
values in one variable are associated
with higher values in another (a
positive relationship) or whether
higher values in the one are
associated with lesser values in the
other (a negative relationship).
It also looks at trying to explain the
variation found in one variable using
others.
Scatter plots

Scatter plots are
an effective way
of seeing if there
is any
relationship
between two
variables, whethe
r it is a straight
line
relationship, and
to help detect
errors in the data.
A positive relationship is when the line
of best fit is upwards sloping.
A negative relationship is when it is
downwards sloping.
The X variable (horizontal axis) is the
independent variable.
The Y variable (vertical axis) is the
dependent variable.
It is assumed that the X variable leads
to, possibly even causes, the Y
variable.
Correlation coefficients

A correlation
coefficient
describes the
degree of
association
between two sets
of paired values.
The Pearson
correlation
measures the
strength of the
straight line
relationship of two
variables.
Uses of regression

To summarise data
To make predictions
To explain what causes what
Bivariate regression

Bivariate
regression finds a
line of best fit to
summarise the
relationship
between two
variables.
That line can be
used to make
predictions for
what the Y value
would be for a
given value for X.
It is a line of best
fit, rarely perfect fit.
Regression tables
The strength of the effect of the X variable
on the Y is measured by the gradient of
the line of best fit
– It measures whether a change in X will lead to
  a change in Y and by how much.
We have greater confidence that the effect
is genuine and not a chance property of
the sample the better the line fits the data
(i.e. the less the residual variation around
it).
Regression tables report various measures
and diagnostics including the measured
gradient of the line, the residual error, the
probability the gradient could actually be
zero (no relationship) and goodness-of-fit
measures
Assumptions of regression
           analysis
There are various
types of regression
analysis but the most
common, Ordinary
Least Squares
regression, assumes
that the two variables
are linearly related (or
could be transformed
to be so) and that the
residual errors are
random with no
unexplained patterns.
Visual checks can
easily be made.
Watch out for
leverage
points, extreme
outliers and
Look out for spatial patterns!
Multiple regression

When two or more X variables are
used to explain the Y variable.
In addition to the usual checks (of
linearity and of random errors) need
to check also for multicollinearity
It is often helpful to standardize the
variables so their effects can be
compared
A strategy for multiple regression

 Crawley (2005; 2007) describes the aim
 of statistical modelling as finding a
 minimal adequate model. The process
 involves going from a ―maximal model‖
 containing all the variables of interest to
 a simpler model that fits the data almost
 as well by deleting the least significant
 variables one at a time (and checking
 the impact on the model at each stage
 of doing so). As part of the
 process, consideration also needs to be
 given to outliers and to other checks
 that the regression assumptions are
 being met.
Further reading

Chapter 7 of Statistics for Geography
and Environmental Science by Richard
Harris and Claire Jarvis (Prentice Hall /
Pearson, 2011)
Includes a review of the following key
concepts: scatter plots; independent
and dependent variables; Pearson‘s
correlation coefficient; the equation of a
straight line; residuals; bivariate
regression; outliers and leverage
points; multiple regression; goodness-
of-fit measures; assumptions of OLS
regression; and Occam‘s Razor and the
minimal adequate model.
Module 8
(Extracts from Chapter 8 of Statistics for Geography
and Environmental Science)

DETECTING & MANAGING
SPATIAL DEPENDENCY
Module overview

This module looks at some of the
specifically geographical issues of
analysing data.
The Modifiable Area Unit Problem
The ecological fallacy

In a general sense
– Means that
  statistical            Scale     n        r
  relationships found   Region     9    -0.95
  at one scale may         LA    376    -0.77
  not apply at           ward    8868   -0.55
  another scale
A more specific
meaning
– When inappropriate
  assumptions are
  made about
  individuals from
  using grouped
Spatial autocorrelation

Standard statistics assume the
observations / errors are
independent of each other
But spatial data tend to be more
similar in value at nearby locations
than those further away
– This is positive spatial autocorrelation
Negative spatial autocorrelation is
when nearby measurements are
‗opposite‘ to each other
Spatial autocorrelation
Detecting spatial autocorrelation

The semi-
variogramis used
to explore a data
set visually and to
estimate how far
you need to move
away from a
particular data
point before data
points at that
distance can be
considered
unassociated with
the first.
Other measures of global
       autocorrelation

Moran‘s I
Getis‘ G statistic
Geary‘s C
Joint counts method
Global Vs Local measures

A global measure of spatial autocorrelation
gives a single summary measure of the
patterns of association for the whole study
region.
This can conceal more localised patterns
within the region.
Global measures can often be ‗broken
down‘ into local measures where the
patterns of association are measured and
compared for sub-regions
– E.g. Local Moran‘s I, Local Getis, G.
Can be used to identify ‗hotspots‘ and ‗cold
spots‘ of something (e.g. crime)
Further reading

Chapter 8 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: spatial
autocorrelation; the MAUP; the
ecological fallacy; semi-variance;
semi-variogram; common structures
used to model the semi-variogram;
and hotspots.
Module 9
(Extracts from Chapter 8 of Statistics for Geography
and Environmental Science)

EXPLORING SPATIAL
RELATIONSHIPS
Module overview

This module is about treating where
something happens as useful
information that may help explain
what is happening. The central idea
is when we find geographical
patterns in data and there is
evidence to suggest they did not
arise by chance then it would be
better to explore and model the
cause of the patterns then to treat
them as an inconvenience.
Spatial regression

The spatial error model and the
spatially lagged y model are
examples of spatial regression
models that allow for and measure
the interdependencies between
neighbouring or proximate data.
Neighbourhoods are defined by a
weights matrix indicating, for
example, if places share a boundary.
Example of a weights matrix
Geographically Weighted
  Regression (GWR)
Geographically Weighted
  Regression (GWR)
Multilevel modelling

Multilevel modelling can be used to
model at multiple scales simultaneously
and to explore how individual
behaviours and characteristics are
shaped by the places in which they live
or by the organisations they attend.
Because multilevel models can
consider people in places they are
sometimes used to generate evidence
of a neighbourhood effect.
Also useful for longitudinal analysis
(analysis over time)
Geography, computation and
         statistics

The development of spatial analysis
has been made possible by
advances in computation.
But techniques like GWR are
characterised by repeat fitting and
remain demanding computationally.
There is increasing integration
between geographical information
science, computer science and
statistics.
Further reading

Chapter 9 of Statistics for
Geography and Environmental
Science by Richard Harris and Claire
Jarvis (Prentice Hall / Pearson, 2011)
Includes a review of the following
key concepts: cartograms; spatial
analysis; weights matrices; spatial
econometrics; geographically
weighted regression; local indicators
of spatial association; and multilevel
modelling.
Thank you for your interest.

Statistics for Geography and Environmental Science: an introductory lecture course

  • 1.
    Statistics for Geographyand Environmental Science: an introductory lecture course By Richard Harris, with material by Claire Jarvis USA: http://amzn.to/rNBWd5 UK: http://amzn.to/tZ7fVu
  • 2.
    Based on thetextbook
  • 3.
    Copyright notice Statistics forGeography and Environmental Science: an introductory lecture course, © Richard Harris, 2011. This course is available at www.social-statistics.org and contains extracts from the publication Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall, 2011) You are free to modify these slides for the purpose of non-commercial teaching only, subject to the following restrictions: – This work, or any derivative of it, may not be stored or redistributed in any form, paper or electronic, other than to be available to students for their learning and education, with access to the material restricted to the institution to which those students belong. – Any derivative must retain this copyright in full and at the beginning of the work. The words ‗Based on‘ may be inserted in the first paragraph. – Permission to waive or modify these restrictions may be sought from the author (Richard Harris, School of Geographical Sciences, University of Bristol).
  • 4.
  • 5.
    The modules Module1 makesthe case for knowing about statistics as a transferable skill and to be equipped for social and political debate. Module 2 is about using descriptive statistics and simple graphical techniques to explore and make sense of data. Module 3 discusses the Normal curve, the properties of which provide the basis for inferential
  • 6.
    The modules Module 4is about the principles of research design and effective data collection. Module 6 discusses the role of hypothesis testing. Module 7 is about regression analysis.
  • 7.
    The modules Module 8moves to modelling point patterns, ―hotspot analysis‖ and ways of measuring patterns of spatial autocorrelation in data. Module 9 looks at spatial regression models, geographically weighted regression and multilevel modelling. Each module is explored more fully in the accompanying textbook, Statistics for Geography and Environmental Science.
  • 8.
    Module 1 (Extracts fromChapter 1 of Statistics for Geography and Environmental Science) DATA, STATISTICS AND GEOGRAPHY
  • 9.
    Module overview To convinceyou that studying statistics is a good idea! Our argument is that data collection and analysis are central to the functioning of contemporary society so knowledge of quantitative methods is a necessary skill to contribute to social and scientific debate.
  • 10.
    About statistics Statistics area reflective practice: a way of approaching research that requires a clear and manageable research question to be formulated, a means to answer that question, knowledge of the assumptions of each test used, an understanding of the consequences of violating those assumptions, and awareness of the researcher‘s own prejudices when doing the research.
  • 11.
    Some reasons tostudy statistics Reasons for human geographers – Data collection and analysis are central to the functioning of society, to systems of governance and science. – Knowledge of statistics is an entry into debate, informed critique and the possibility of creating change.
  • 12.
    Some reasons tostudy statistics Reasons for GI scientists – To address the uncertainties and ambiguities of using data analytical. – Because of the increased integration of mapping capabilities, data visualizations and (geo-) statistical analysis.
  • 13.
    Some reasons tostudy statistics Reasons for all students – They provide a transferable skill set using in other areas of research, study and employment. – There is a recognised shortage of students with skills in quantitative methods, especially within the social sciences.
  • 14.
    Types of statistic Descriptive –Used to provide a summary of a set of measurements, e.g. the average. Inferential – Use the data at hand to convey information about the population (‗the greater something‘) from which the data are drawn. Relational – Consider whether greater or lesser values in one set of data are related to greater or lesser values in another.
  • 15.
    Geographical data These arerecords of what has happened at some location on the Earth‘s surface and where. For many statistical tests the where is largely ignored. However, it is central to geostatistics and to spatial statistics (as their names suggest)
  • 16.
    Some problems whenanalysing geographical data Standard statistical tests assume that each ‗bit‘ of data (each observation) has a value that is not influenced by any other. However, we may often expect there to be geographical patterns in the data. – Spatial autocorrelation: geographical patterns in the measurements
  • 17.
    Some problems whenanalysing geographical data Determining what causes what in a complex and dynamic natural or social system is extremely tricky. Two things may be associated (e.g. greater income inequality and more non-recycled waste) without the one directly causing the other.
  • 18.
    Some problems whenanalysing geographical data Data and structured forms of enquiry can only tell us so much and may not be appropriate to some types of research for which a more qualitative, participatory or less representational approach may be better.
  • 19.
    Further reading Chapter 1of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: types of statistics; why error is unavoidable; geographical data analysis; and spatial autocorrelation and the first law of geography.
  • 20.
    Module 2 (Extracts fromChapter 2 of Statistics for Geography and Environmental Science) DESCRIPTIVE STATISTICS
  • 21.
    Module overview This moduleis about ―everyday statistics‖, the sort that summarise data and describe them in simple ways. They include the number of home runs this season, average male earnings, numbers unemployed, outside temperature, average cost of a barrel of oil, regional variations in crime rates, pollution statistics, measures of the economy and other ―facts and figures‖
  • 22.
    Data and variables Data –A collection of observations: measurements made of something. A variable – Another name for a collection of data. Variable because it is unlikely that the data are all the same. Data types – These include discrete, continuous, and categorical data.
  • 23.
    Simple ways ofpresenting data Discrete data Continuous data Frequency table Summary table Bar chart (below) Histogram (below, with a rug plot)
  • 24.
  • 25.
    Information to include in a summary table Measures of central tendency (―averages‖) – The mean and/or median • The ―centre‖ of the data Measures of spread and variation – The range (minimum to maximum) – The interquartile range (from ‗mid- spread‘ of the data) – The standard deviation,s
  • 26.
    More about thestandard deviation Essentially a measure of average variation around the mean. It is also the square root of the variance. The variance is the sum of squares divided by the degrees of freedom
  • 27.
    Boxplots Are useful for showingthe median, interquartile range and range of a set of data, for indentifying outliers and also for comparing variables.
  • 28.
    Other ways ofclassifying numeric data Nominal, ordinal, interval and ratio Counts and rates Proportions and percentages Parametric and non—parametric Arithmetic and geometric Primary and secondary
  • 29.
    Further reading Chapter 2of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: data and variables; discrete and continuous data; the range; histograms, rug plots, and stem and leaf plots; measures of central tendency; why averages can be misleading; quantiles; the sum of squares; degrees of freedom; the standard deviation and the variance; box plots; and five and six number summaries
  • 30.
    Module 3 (Extracts fromChapter 3 of Statistics for Geography and Environmental Science) THE NORMAL CURVE
  • 31.
    Module overview This moduleintroduces the normal curve, so called because it is how many social and scientific data appear distributed.
  • 32.
    The normal curve Itis also known as the Gaussian distribution and is often described as ‗bell-shaped‘ It is a family of distributions all of which have the same probability density function (the same formula defining their shape).
  • 33.
    The central limittheorem The central limit theorem states that the sum (and therefore average) of a large number of independent and identically distributed random variables will approach a normal distribution as the sample size increases, even if the variables are not themselves normally distributed.
  • 34.
    Properties of anormal curve Ranges from negative to positive infinity Is symmetrical around its mean 95% of the area under the curve is within 1.96 standard deviations of the mean 99% of the area is within 2.58 standard deviations.
  • 35.
    Properties of anormal curve Consequently, if a data set is approximately Normal, the probability of selecting, at random, an observation at that is within 1.96 standard deviations of the mean is p = 0.95, and the probability it will be within 2.58 standard deviations is p =0.99.
  • 36.
    Standardising data (zvalues) Data are standardised if their original measurement units are replaced with units of standard deviation from the mean (z values). It is a little like converting a proportion (0 to 1) to a percentage (0 to 100): it doesn‘t change the shape of the data.
  • 37.
    Standardising data (zvalues) The z values are calculated by subtracting the mean of the data from each observation and then dividing by the standard deviation. Once data are standardised and assuming they are approximately normal then they can be compared against the Standard Normal curve. This is a special instance of a normal curve that has a mean of zero and a standard deviation of one. It provides a model or benchmark for the data.
  • 38.
    Probability and theStandard Normal The area between two z values (under the Standard Normal) is the probability of selecting an observation randomly from the data that will have a z value between those two values. That area can be determined using a statistical table or equivalent.
  • 39.
    Probability and theStandard Normal See the worked examples on pp. 62–70 of Statistics for Geography and Environmental Science
  • 40.
    Some data areskewed but can often be transformed to approximate normality
  • 41.
    The quantile plot Useful(and better than a histogram) to check for non- normality, such as skew and the presence of outliers. If the data were normal they‘d be distributed along the straight line.
  • 42.
    Further reading Chapter 3of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: properties of normal curves; the central limit theorem; probability and the normal curve; finding the area under the normal curve; skewed data and the ‗ladder of transformation‘; moments of a distribution; and the quantile plot.
  • 43.
    Module 4 (Extracts fromChapter 4 of Statistics for Geography and Environmental Science) SAMPLING
  • 44.
    Module overview It israrely possible or necessary to collect all possible data about something that is being studied. This module is about how to go about collecting a sample of data that is fit for a particular research task.
  • 45.
    Sampling It is commonin geographical and other research to gather a sample (or subset) of data from a target population. The aim is for the sample to be representative of that population. Sampling bias occurs when the sample favour some parts of the target population more than others, perhaps by sampling at an unrepresentative time or place or because of the data collection
  • 46.
    The process ofsampling Define the research question Review the related literature Review the scope of the planned study Construct a sample frame Select a sample design method Review the design from practical, ethical, safety and logistical perspectives Implement the design and collect the
  • 47.
    Sampling methods Sampling methods Non-probabilistic Probabilistic sampling sampling methods methods Judgemental Convenience Simple Systematic random Quota Snowball Clustered Stratified random random
  • 48.
    Sampling methods The differentmethods are outlined on pp. 94-105 of Statistics for Geography and Environmental Science. In general, random sampling methods are preferred because the errors in the data should be random too. However, a random sample won‘t necessarily offer a wide enough coverage of the target population. Therefore stratified samples may be used which may themselves target specific, representative places to reduce the cost and ease the logistics of the data collection.
  • 49.
    Sampling error andsample size The impression that is formed of the target population depends on the sample of data taken to represent it. It is possible that a random sample accidentally misrepresents the population if it happens only to observe its most unusual occurrences: it is susceptible to sampling error. The larger the sample (the more observations there are) there smaller the error is expected to be but with ‗diminishing returns‘ – the error is generally proportional to the square root of the sample size
  • 50.
    Sampling error andsample size The error is also a function of how much the target population varies – If it were exactly the same, everywhere, it wouldn‘t matter where the samples were taken A larger sample is costly and more time consuming to collect. However, a small sample of a highly variable population is unlikely to generate any statistically meaningful analytical results.
  • 51.
    Sampling methods: issuesand practicalities Personal safety, gaining permission from an ethics committee, what to do about missing data. Practical considerations – Weight and/or volume of the sample, import/export restrictions, analytical costs Instrument accuracy and scale Bottom line: if your sample is no good, your analysis won‘t be any good either.
  • 52.
    Further reading Chapter 4of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: the target population; representative samples; sampling frames; sampling bias; metadata; fitness for purpose and use; sample design; sampling error and sample size; sample size and replicates; and measurement accuracy.
  • 53.
    Module 5 (Extracts fromChapter 5 of Statistics for Geography and Environmental Science) FROM DESCRIPTION TO INFERENCE
  • 54.
    Module overview This moduleis about inference. Inference is at the heart of how and why statistics developed. It moves beyond simply summarising data (the sample) to using those summaries to gain insights into the underlying system, process or structure that the data are measurements of (the population).
  • 55.
    A population Its meaningisn‘t restricted to ―everyone who lives in a particular place‖ but can be much more abstract. – ―Every possible object (or entity) from which the sample is selected.‖ – ―The complete set of all possible measurements that might hypothetically be recorded.‖ Informally: the complete ‗thing‘ that you are interested in study but which can‘t be measured in its entirety.
  • 56.
    Each sample changesour impression of the population
  • 57.
    The sample meanand the population mean Assuming the sample is representative (unbiased), it is possible to estimate the true mean of the population from the mean for the sample. – The population mean from the sample mean But that estimate is sample dependent – Change the sample and you get a different estimate
  • 58.
    Confidence intervals It isimprobable that the sample mean is exactly equal to the population mean – And we wouldn‘t know even if it was. • (unless we sampled the population in its entirety, in which case they‘d be no need to make an estimate!) However, we can place a confidence interval around the sample mean and estimate the probability that confidence interval contains the population mean.
  • 59.
  • 60.
    The width ofa confidence interval The confidence interval is wider – The greater the probability you want it to contain the unknown population mean – The more variable the data are (the greater their variance / standard deviation) – The less data you have
  • 61.
    The standard error(of the mean) The standard deviation of the data divided by the square root of the number of observations gives an estimate of the standard error (of the mean) and is a measure of uncertainty in the data – The greater the standard error, the greater the uncertainty
  • 62.
    Why confidence intervals‘work’ In principle, if a population were sampled a very large number of times, the sample mean calculated in each case and those means then collected together to form a new variable, we‘d find that variable to be normally distributed, centred on the population mean and with a standard deviation equal to the standard error of the mean.
  • 63.
    Small samples For smallsamples the confidence interval will be underestimated if it is calculated with reference to a normal distribution. A t-distribution is used instead. This is ‗fatter‘ than the normal. Intuitively: we are more cautious with small samples that contain little information. The confidence intervals are widened to reflect that caution.
  • 64.
    Summary Mean of thesample Known Standard deviation of the sample Known Standard deviation of the Unknown but approximated by population the standard deviation of the sample Standard error of the mean Estimated as the standard deviation of the sample divided by the square root of the sample size Mean of the population Unknown but we can estimate the probability that it has a value that lies within a given number of standard errors either side of the sample mean (within a given confidence interval)
  • 65.
    Further reading Chapter 5of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: inference, samples and populations; the distribution of the sample means; standard error of the mean; confidence intervals; the t- distribution and confidence intervals for ‗small samples‘
  • 66.
    Module 6 (Extracts fromChapter 6 of Statistics for Geography and Environmental Science) HYPOTHESIS TESTING
  • 67.
    Module overview This moduleintroduces hypothesis testing as a way of formally questioning whether a population mean could plausibly be equal to a hypothesised value, and to consider whether two or more samples of data were most probably drawn from the same population.
  • 68.
    The process ofhypothesis testing Define the null hypothesis Define the alternative hypothesis Specify an alpha value – The maximum probability of rejecting the null hypothesis when it is, in fact, correct. Calculate the test statistic Compare the test statistic with a critical value Reject the null hypothesis is the test statistic has greater magnitude than the critical value
  • 69.
    One-sample t test Theone-sample t test measures the number of standard errors the sample mean is from an hypothesised value. The further it is, the less probably the sample is drawn from a population with a mean equal to that hypothesised value. The p value records that probability. A p value of 0.95 or greater means we can be (at least) ―95% confident‖ that the ―true mean‖ (the population mean) for whatever has been measured is not the hypothesised value. A p value of 0.99 or greater gives 99% confidence
  • 70.
    Two-sample t test Considersthe probability that two samples of data do not have the same population mean. – If they don‘t, it suggests the samples measure categorically different things. It works by measuring the difference between the sample means relative to the variance of the samples. There are different versions of the t test, for example for paired data and for whether the two samples have approximately equal variance or not. An F test is used to compare the sample variances and see if any difference could be due to chance.
  • 71.
    Analysis of variance(ANOVA) Used to test whether three or more groups of data have the same population mean. Considers the variations between groups relative to the variation within groups. Contrasts can be used to specifically contrast one or more of the groups with one or more of the
  • 72.
    Two- and one-tailedtests A two-tailed test is non-directional whereas a one-tailed test is directional. Consider a one- sample t test – The alternative hypothesis for a two- tailed test is only that the population mean is not equal to the hypothesised value. – A one-tailed test it specifies which is the greater
  • 73.
    Non-parametric tests Non-parametric tests donot begin Parametric Non- with fixed parametric assumptions about Two-samplet Wilcoxon rank how the data and test sum test (aka Mann-Whitney the population are test) distributed ANOVA Kruskal-Wallis – E.g. a normal test distribution However, if the assumptions are met, it is better to use the parametric test.
  • 74.
    Possible outcomes ofa statistical test
  • 75.
    Power We worry aboutlimiting the probability of rejecting the null hypothesis when it is correct (of making a wrong decision) – Of having a low p value But we could avoid the error by never rejecting the null hypothesis Except, that‘s daft because the null hypothesis could be wrong. So, also need to think about the probability of rejecting the null hypothesis when it is indeed wrong – The probability of making this, the right decision, is the power of the test.
  • 76.
    Further reading Chapter 6of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: type I errors; the one- sample t test; hypothesis testing; two- and one-tailed tests; type II errors and statistical power; homoscedasticity, heteroscedasticity and the F test; analysis of variance; measuring effects; and parametric and non-parametric tests.
  • 77.
    Module 7 (Extracts fromChapter 7 of Statistics for Geography and Environmental Science) RELATIONSHIPS AND EXPLANATIONS
  • 78.
    Module overview This modulelooks at relational statistics, exploring whether higher values in one variable are associated with higher values in another (a positive relationship) or whether higher values in the one are associated with lesser values in the other (a negative relationship). It also looks at trying to explain the variation found in one variable using others.
  • 79.
    Scatter plots Scatter plotsare an effective way of seeing if there is any relationship between two variables, whethe r it is a straight line relationship, and to help detect errors in the data.
  • 80.
    A positive relationshipis when the line of best fit is upwards sloping. A negative relationship is when it is downwards sloping. The X variable (horizontal axis) is the independent variable. The Y variable (vertical axis) is the dependent variable. It is assumed that the X variable leads to, possibly even causes, the Y variable.
  • 81.
    Correlation coefficients A correlation coefficient describesthe degree of association between two sets of paired values. The Pearson correlation measures the strength of the straight line relationship of two variables.
  • 82.
    Uses of regression Tosummarise data To make predictions To explain what causes what
  • 83.
    Bivariate regression Bivariate regression findsa line of best fit to summarise the relationship between two variables. That line can be used to make predictions for what the Y value would be for a given value for X. It is a line of best fit, rarely perfect fit.
  • 84.
    Regression tables The strengthof the effect of the X variable on the Y is measured by the gradient of the line of best fit – It measures whether a change in X will lead to a change in Y and by how much. We have greater confidence that the effect is genuine and not a chance property of the sample the better the line fits the data (i.e. the less the residual variation around it). Regression tables report various measures and diagnostics including the measured gradient of the line, the residual error, the probability the gradient could actually be zero (no relationship) and goodness-of-fit measures
  • 85.
    Assumptions of regression analysis There are various types of regression analysis but the most common, Ordinary Least Squares regression, assumes that the two variables are linearly related (or could be transformed to be so) and that the residual errors are random with no unexplained patterns. Visual checks can easily be made. Watch out for leverage points, extreme outliers and
  • 86.
    Look out forspatial patterns!
  • 87.
    Multiple regression When twoor more X variables are used to explain the Y variable. In addition to the usual checks (of linearity and of random errors) need to check also for multicollinearity It is often helpful to standardize the variables so their effects can be compared
  • 88.
    A strategy formultiple regression Crawley (2005; 2007) describes the aim of statistical modelling as finding a minimal adequate model. The process involves going from a ―maximal model‖ containing all the variables of interest to a simpler model that fits the data almost as well by deleting the least significant variables one at a time (and checking the impact on the model at each stage of doing so). As part of the process, consideration also needs to be given to outliers and to other checks that the regression assumptions are being met.
  • 89.
    Further reading Chapter 7of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: scatter plots; independent and dependent variables; Pearson‘s correlation coefficient; the equation of a straight line; residuals; bivariate regression; outliers and leverage points; multiple regression; goodness- of-fit measures; assumptions of OLS regression; and Occam‘s Razor and the minimal adequate model.
  • 90.
    Module 8 (Extracts fromChapter 8 of Statistics for Geography and Environmental Science) DETECTING & MANAGING SPATIAL DEPENDENCY
  • 91.
    Module overview This modulelooks at some of the specifically geographical issues of analysing data.
  • 92.
    The Modifiable AreaUnit Problem
  • 93.
    The ecological fallacy Ina general sense – Means that statistical Scale n r relationships found Region 9 -0.95 at one scale may LA 376 -0.77 not apply at ward 8868 -0.55 another scale A more specific meaning – When inappropriate assumptions are made about individuals from using grouped
  • 94.
    Spatial autocorrelation Standard statisticsassume the observations / errors are independent of each other But spatial data tend to be more similar in value at nearby locations than those further away – This is positive spatial autocorrelation Negative spatial autocorrelation is when nearby measurements are ‗opposite‘ to each other
  • 95.
  • 96.
    Detecting spatial autocorrelation Thesemi- variogramis used to explore a data set visually and to estimate how far you need to move away from a particular data point before data points at that distance can be considered unassociated with the first.
  • 97.
    Other measures ofglobal autocorrelation Moran‘s I Getis‘ G statistic Geary‘s C Joint counts method
  • 98.
    Global Vs Localmeasures A global measure of spatial autocorrelation gives a single summary measure of the patterns of association for the whole study region. This can conceal more localised patterns within the region. Global measures can often be ‗broken down‘ into local measures where the patterns of association are measured and compared for sub-regions – E.g. Local Moran‘s I, Local Getis, G. Can be used to identify ‗hotspots‘ and ‗cold spots‘ of something (e.g. crime)
  • 100.
    Further reading Chapter 8of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: spatial autocorrelation; the MAUP; the ecological fallacy; semi-variance; semi-variogram; common structures used to model the semi-variogram; and hotspots.
  • 101.
    Module 9 (Extracts fromChapter 8 of Statistics for Geography and Environmental Science) EXPLORING SPATIAL RELATIONSHIPS
  • 102.
    Module overview This moduleis about treating where something happens as useful information that may help explain what is happening. The central idea is when we find geographical patterns in data and there is evidence to suggest they did not arise by chance then it would be better to explore and model the cause of the patterns then to treat them as an inconvenience.
  • 103.
    Spatial regression The spatialerror model and the spatially lagged y model are examples of spatial regression models that allow for and measure the interdependencies between neighbouring or proximate data. Neighbourhoods are defined by a weights matrix indicating, for example, if places share a boundary.
  • 104.
    Example of aweights matrix
  • 105.
    Geographically Weighted Regression (GWR)
  • 106.
    Geographically Weighted Regression (GWR)
  • 107.
    Multilevel modelling Multilevel modellingcan be used to model at multiple scales simultaneously and to explore how individual behaviours and characteristics are shaped by the places in which they live or by the organisations they attend. Because multilevel models can consider people in places they are sometimes used to generate evidence of a neighbourhood effect. Also useful for longitudinal analysis (analysis over time)
  • 108.
    Geography, computation and statistics The development of spatial analysis has been made possible by advances in computation. But techniques like GWR are characterised by repeat fitting and remain demanding computationally. There is increasing integration between geographical information science, computer science and statistics.
  • 109.
    Further reading Chapter 9of Statistics for Geography and Environmental Science by Richard Harris and Claire Jarvis (Prentice Hall / Pearson, 2011) Includes a review of the following key concepts: cartograms; spatial analysis; weights matrices; spatial econometrics; geographically weighted regression; local indicators of spatial association; and multilevel modelling.
  • 110.
    Thank you foryour interest.