GIS in Public Health Research: Understanding Spatial Analysis and Interpreting Outcomes 1-31-14

GIS in Public Health Research:
Understanding Spatial Analysis &
Interpreting Outcomes
Kristin Osiecki PhD

Houston Aerosol Characterization &
Health Experiment (HACHE)

• UT Health Science Center School of
Biomedical Informatics
• University of Houston Department
of Earth and Atmospheric Sciences
• Rice University Department of
Sociology and Department of Civil &
Environmental Engineering

Applications in Public Health Research
• Space matters
– communities,census tracts, counties, states

• Multidisciplinary and Interdisciplinary
• Collaborative
• Simple and Complex Models

What research questions are we
trying to answer?
• Do we need visualizations or maps? OR
• Are we interested in investigating possible
spatial relationships within the data?

ArcGIS Toolbox
Handyman’s Dream
or
Do-it-yourself nightmare?

Objectives
•
•
•
•

Traditional Statistics & Spatial Analysis
Permutations
Spatial Weights
EDA & ESDA

"Spatial Statistics" does not mean
applying traditional (non-spatial)
statistical methods to data that just
happens to be spatial (has X and Y
coordinates).
Source: ESRI
http://resources.esri.com/help/9.3/arcgisen
gine/java/gp_toolref/spatial_statistics_tools
/how_generate_spatial_weights_matrix_spa
tial_statistics_works.htm

Spatial Analysis

Traditional
Statistical
Methodology

Spatial
Methodology

Global & Local
Global
Model

EDA
ESDA

Global autocorrelation
Local autocorrelation

Local
Model

The most crucial step in the process

Exploring the Data: EDA & ESDA

Scatter Plot Matrix
1
0.8
0.6
0.4

pct_pov

0.2
0

p_FHH

0

p_blck

pct_pov

0.2

0.4
0.6
p_blck x p_FHH

0.8

1

Exploratory Spatial Data Analysis
• Interactively visualize and explore data
where space matter
• Detect patterns
• Hypothesis generation
• spatial modeling is needed to test
hypotheses
• Works on point feature and polygon
features (i.e. census, epidemiology,
demographic layers)

What is Spatial Randomness?
• Observed spatial pattern of value is equally as
likely as any other spatial pattern
• Value at one location does not depend on
values at neighboring locations under spatial
randomness, the location of values may be
altered without affecting the information
content of the data
• random permutation or reshuffling of values
Dr. Luc Anselin 2012

Spatial Randomness
• Spatial Randomness Null Hypothesis
– Spatial randomness is absence in any pattern
– If rejected, evidence of spatial structure

Dr. Luc Anselin 2012

ArcGIS Spatial Autocorrelation
• The Randomization Null Hypothesis: Where appropriate, the tools in the
Spatial Statistics toolbox use the randomization null hypothesis as the
basis for statistical significance testing. The randomization null hypothesis
postulates that the observed spatial pattern of your data represents one
of many (n!) possible spatial arrangements. If you could pick up your data
values and throw them down onto the features in your study area, you
would have one possible spatial arrangement of those values. (Note that
picking up your data values and throwing them down arbitrarily is an
example of a random spatial process). The randomization null hypothesis
states that if you could do this exercise (pick them up, throw them down)
infinite times, most of the time you would produce a pattern that would
not be markedly different from the observed pattern (your real data).
Once in a while you might accidentally throw all the highest values into
the same corner of your study area, but the probability of doing that is
small. The randomization null hypothesis states that your data is one of
many, many, many possible versions of complete spatial randomness. The
data values are fixed; only their spatial arrangement could vary.
http://resources.arcgis.com/en/help/main/10.
1/index.html#//005p00000006000000

Permutations
• A numerical approach to testing for statistical
significance (in contrast to analytical
approaches)
• It is data-driven and makes no assumptions
(such as normality) about the data

Permutations in Geoda
• Permutation inference is shuffling values around
and re-computing statistics each time with a
different set of random numbers to construct a
reference distribution.
• Permutations are used to determine how likely it
would be to observe the Moran’s I value of an
actual distribution under conditions of spatial
randomness.
• P-values are dependent on the number of
permutations so they are “pseudo p-values”

Spatial Weights
The first step in the analysis of spatial
autocorrelation is to construct a spatial weights
file that contains information on the
“neighborhood” structure for each location
(luc anselin)

Generation of Spatial Weights ESRI
• For binary strategies (fixed distance, K nearest
neighbors, or contiguity) a feature is either a
neighbor (1) or it is not (0).
• For weighted strategies (inverse distance or
zone of indifference) neighboring features
have a varying amount of impact (or
influence) and weights are computed to
reflect that variation.

Row Standardization
• Adjusts the weights in a spatial weights matrix
• Each weight is divided by its row sum
• The row sum is the sum of weights for a
feature’s neighbors.
• A weights matrix is row-standardized when
the values of each of its rows sum to one.

Binary vs. row-standardized
• A binary weights matrix looks like:
0

1

0

0

0

0

1

1

1

1

0

0

0

1

1

1

• A row-standardized matrix it looks like:
0

1

0

0

0

0

.5

.5

.5

.5

0

0

0

.33

.33

.33

Spatial Weights

• Formal expression of locational similarity

Distance Models
• Inverse distance – all features influence all
other features, but the closer something is,
the more influence it has
• Distance band – features outside a specified
distance do not influence the features within
the area
• Zone of indifference – combines inverse
distance and distance band

Inverse Distance (impedance) (ArcGIS)
• features impact/influence all other features
– farther away something is, the smaller the impact

• specify a Distance Band/Threshold Distance value
to reduce the number of required computations
– especially with large datasets.
– If not specified, a default threshold
value is computed for you

• Choosing an appropriate distance is important
– Some spatial statistics require each feature to have at
least one neighbor for the analysis to be reliable.

Distance band (sphere of influence)
• impose a sphere of influence, or moving window
conceptual model of spatial interactions onto the data
• Neighbors within the specified distance are weighted
equally. Features outside have no influence (weight = 0)
• Evaluate the statistical properties of your data at a
particular (fixed) spatial scale
• have at least one neighbor, or results will not be valid
• if the input data is skewed make sure that your distance
band is neither too small (only one or two neighbors) nor
too large (include all other features as neighbors)
– resultant z-scores less reliable.

Adjacency Models
• K Nearest Neighbors – a specified number of
neighboring features are included in
calculations
• Polygon Contiguity – polygons that share an
edge or node influence each other

K-nearest neighbors

• each feature assessed in the spatial context of a
specified number of its closest neighbors. If K (t is
8, then eight closest neighbors to the target
feature will be included If feature density is high spatial context of the analysis will be smaller.
• If feature density is sparse, the spatial context for
the analysis will be larger.
• method is available using the Generate Spatial
Weights Matrix tool

Polygon contiguity (first order)
• polygons that share an edge (that have
coincident boundaries) are included in
computations for the target polygon
• modeling some type of contagious process or
are dealing with continuous data represented
as polygons.

Binary Contiguity Weights
• contiguity = common border
• i and j share a border, then wij = 1
• i and j are not neighbors, then wij = 0
• weights are 0 or 1, hence binary
Distance-Based Weights
• distance between points
• distance between polygon
centroids or central points
• distance-band weights:
wij nonzero for dij < d
less than a critical distance d
• k-nearest neighbor weights:
same number of neighbors for all
observations
potential problems with ties

Global vs. Local Statistics
• Global statistics (Clustering) – identify and
measure the pattern of the entire study area
– Do not indicate where specific patterns occur

• Local Statistics (Clusters) – identify variation
across the study area, focusing on individual
features and their relationships to nearby
features (i.e. specific areas of clustering)

Spatial Autocorrelation (Moran’s I)
• Global statistic
• Measures whether the pattern of feature values is clustered,
dispersed, or random.
• Compares the difference between the mean of the target
feature and the mean for all features to the difference
between the mean for each neighbor and the mean for all
features.
Mean of Target
Feature

Mean of each
neighbor
Mean of
all
features

Z-Score & P-value (ArcGIS)
• Very high or very low (negative) z-scores,
associated with very small p-values, are found in
the tails of the normal distribution
• it is unlikely that the observed spatial pattern
reflects the theoretical random pattern
represented by your null hypothesis (CSR)
• The null hypothesis for the pattern analysis tools
is Complete Spatial Randomness (CSR), either of
the features themselves or of the values
associated with those features.
http://resources.arcgis.com/en/help/main/10.
1/index.html#//005p00000006000000

Pseudo P-Value
• significance levels are dependent on the
number of permutations
• One-sided significance test
• For instance, if an observed Moran's I value is
higher than any of the randomly generated
Moran's I values, the pseudo p-value would be
1/100=0.01 for 99 permutations or
1/1,000=0.001 for 999 permutations

Polygon Contiguity (first order)

Polygon Contiguity (first order)
Percent Black Population, Cook County, IL

Generate Spatial Weights Matrix
K-Nearest Neighbor

K-Nearest Neighbor

Spatial Autocorrelation (Getis –Ord General G High/Low Clustering)
Polygon Contiguity

If the z-score value is positive, the observed General G index is larger than the expected
General G index, indicating high values for the attribute are clustered in the study area

Geoda Spatial Autocorrelation (Moran’s I)

Queen Contiguity Weight (1st order)

K-Nearest Neighbor (eight)

K-Nearest Neighbor (four)

Anselin Local Moran’s I
• Local statistic
• Measures the strength of patterns for
each specific feature.
• Compares the value of each feature in a
pair to the mean value for all features in
the study area.

• Positive I value:
– Feature is surrounded by features with similar values, either high or low.
– Feature is part of a cluster.
– Statistically significant clusters can consist of high values (HH) or low
values (LL)

• Negative I value:
– Feature is surrounded by features with dissimilar values.
– Feature is an outlier.
– Statistically significant outliers can be a feature with a high value
surrounded by features with low values (HL) or a feature with a low
value surrounded by features with high values (LH).


• The z- scores and p-values are measures of statistical
significance which tell you whether or not to reject the
null hypothesis, feature by feature.
• Indicate whether the apparent similarity (or
dissimilarity) in values for a feature and its neighbors is
greater than one would expect in a random distribution.
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/clu
ster_and_outlier_analysis_colon_anselin_local_moran_s_i_spatial_statistics_.htm

index

z-score

p-value
Anselin’s Local Moran’s I
Polygon Contiguity Weight
Percent Black Population
Cook County, IL

HH LH

Geoda Univariate LISA
Queen Contiguity Weight
p-values 499 Permutations

p-values 999 Permutations

Geoda Univariate LISA
HH HL 999 Permutations

Comparison ArcGIS & Geoda Results
p-values

Comparison ArcGIS & Geoda Univariate LISA
HH HL

HH HL 999 Permutations

Bivariate LISA Scatterplot
High - High

Low-Low

High - Low

Non-point Source
Cancer Risk

Low-High

Percent Poverty

Chow test for selected/unselected regression subsets distribution F(2,1339)
ratio=214.6 p-value=0
INTERCEPT
# of

R^2

Constant

Observations

Std

t-statistic

SLOPE
p-value

Slope

Error

Std

t-statistic

p-value

Error

1343

0.209

0.00442

0.0176

0.251

0.802

0.332

0.0176

18.8

0

80

0.1116

1.58

0.0797

19.8

0

0.045

0.0475

0.957

0.342

1263

0.118

-0.0794

0.0161

-4.92

0

0.223

0.0172

13

0

Global
Model

EDA
ESDA

Local
Model

GIS in Public Health Research: Understanding Spatial Analysis and Interpreting Outcomes 1-31-14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to GIS in Public Health Research: Understanding Spatial Analysis and Interpreting Outcomes 1-31-14

Similar to GIS in Public Health Research: Understanding Spatial Analysis and Interpreting Outcomes 1-31-14 (20)

Recently uploaded

Recently uploaded (20)

GIS in Public Health Research: Understanding Spatial Analysis and Interpreting Outcomes 1-31-14