Geographic information systems (GIS) allow us to visualize data to better understand public health issues in our communities. Maps help recognize patterns for hypothesis generation; however, spatial analysis is necessary to substantiate relationships and produce meaningful outcomes. In this presentation we will discuss a few of the basic questions related to spatial analysis:
4. • UT Health Science Center School of
Biomedical Informatics
• University of Houston Department
of Earth and Atmospheric Sciences
• Rice University Department of
Sociology and Department of Civil &
Environmental Engineering
5.
6.
7. Applications in Public Health Research
• Space matters
– communities,census tracts, counties, states
• Multidisciplinary and Interdisciplinary
• Collaborative
• Simple and Complex Models
8. What research questions are we
trying to answer?
• Do we need visualizations or maps? OR
• Are we interested in investigating possible
spatial relationships within the data?
11. "Spatial Statistics" does not mean
applying traditional (non-spatial)
statistical methods to data that just
happens to be spatial (has X and Y
coordinates).
Source: ESRI
http://resources.esri.com/help/9.3/arcgisen
gine/java/gp_toolref/spatial_statistics_tools
/how_generate_spatial_weights_matrix_spa
tial_statistics_works.htm
22. Exploratory Spatial Data Analysis
• Interactively visualize and explore data
where space matter
• Detect patterns
• Hypothesis generation
• spatial modeling is needed to test
hypotheses
• Works on point feature and polygon
features (i.e. census, epidemiology,
demographic layers)
23. What is Spatial Randomness?
• Observed spatial pattern of value is equally as
likely as any other spatial pattern
• Value at one location does not depend on
values at neighboring locations under spatial
randomness, the location of values may be
altered without affecting the information
content of the data
• random permutation or reshuffling of values
Dr. Luc Anselin 2012
24. Spatial Randomness
• Spatial Randomness Null Hypothesis
– Spatial randomness is absence in any pattern
– If rejected, evidence of spatial structure
Dr. Luc Anselin 2012
25. ArcGIS Spatial Autocorrelation
• The Randomization Null Hypothesis: Where appropriate, the tools in the
Spatial Statistics toolbox use the randomization null hypothesis as the
basis for statistical significance testing. The randomization null hypothesis
postulates that the observed spatial pattern of your data represents one
of many (n!) possible spatial arrangements. If you could pick up your data
values and throw them down onto the features in your study area, you
would have one possible spatial arrangement of those values. (Note that
picking up your data values and throwing them down arbitrarily is an
example of a random spatial process). The randomization null hypothesis
states that if you could do this exercise (pick them up, throw them down)
infinite times, most of the time you would produce a pattern that would
not be markedly different from the observed pattern (your real data).
Once in a while you might accidentally throw all the highest values into
the same corner of your study area, but the probability of doing that is
small. The randomization null hypothesis states that your data is one of
many, many, many possible versions of complete spatial randomness. The
data values are fixed; only their spatial arrangement could vary.
http://resources.arcgis.com/en/help/main/10.
1/index.html#//005p00000006000000
26. Permutations
• A numerical approach to testing for statistical
significance (in contrast to analytical
approaches)
• It is data-driven and makes no assumptions
(such as normality) about the data
27. Permutations in Geoda
• Permutation inference is shuffling values around
and re-computing statistics each time with a
different set of random numbers to construct a
reference distribution.
• Permutations are used to determine how likely it
would be to observe the Moran’s I value of an
actual distribution under conditions of spatial
randomness.
• P-values are dependent on the number of
permutations so they are “pseudo p-values”
29. Spatial Weights
The first step in the analysis of spatial
autocorrelation is to construct a spatial weights
file that contains information on the
“neighborhood” structure for each location
(luc anselin)
30. Generation of Spatial Weights ESRI
• For binary strategies (fixed distance, K nearest
neighbors, or contiguity) a feature is either a
neighbor (1) or it is not (0).
• For weighted strategies (inverse distance or
zone of indifference) neighboring features
have a varying amount of impact (or
influence) and weights are computed to
reflect that variation.
31. Row Standardization
• Adjusts the weights in a spatial weights matrix
• Each weight is divided by its row sum
• The row sum is the sum of weights for a
feature’s neighbors.
• A weights matrix is row-standardized when
the values of each of its rows sum to one.
34. Distance Models
• Inverse distance – all features influence all
other features, but the closer something is,
the more influence it has
• Distance band – features outside a specified
distance do not influence the features within
the area
• Zone of indifference – combines inverse
distance and distance band
35. Inverse Distance (impedance) (ArcGIS)
• features impact/influence all other features
– farther away something is, the smaller the impact
• specify a Distance Band/Threshold Distance value
to reduce the number of required computations
– especially with large datasets.
– If not specified, a default threshold
value is computed for you
• Choosing an appropriate distance is important
– Some spatial statistics require each feature to have at
least one neighbor for the analysis to be reliable.
36. Distance band (sphere of influence)
• impose a sphere of influence, or moving window
conceptual model of spatial interactions onto the data
• Neighbors within the specified distance are weighted
equally. Features outside have no influence (weight = 0)
• Evaluate the statistical properties of your data at a
particular (fixed) spatial scale
• have at least one neighbor, or results will not be valid
• if the input data is skewed make sure that your distance
band is neither too small (only one or two neighbors) nor
too large (include all other features as neighbors)
– resultant z-scores less reliable.
37. Adjacency Models
• K Nearest Neighbors – a specified number of
neighboring features are included in
calculations
• Polygon Contiguity – polygons that share an
edge or node influence each other
38. K-nearest neighbors
• each feature assessed in the spatial context of a
specified number of its closest neighbors. If K (t is
8, then eight closest neighbors to the target
feature will be included If feature density is high spatial context of the analysis will be smaller.
• If feature density is sparse, the spatial context for
the analysis will be larger.
• method is available using the Generate Spatial
Weights Matrix tool
39. Polygon contiguity (first order)
• polygons that share an edge (that have
coincident boundaries) are included in
computations for the target polygon
• modeling some type of contagious process or
are dealing with continuous data represented
as polygons.
40. Binary Contiguity Weights
• contiguity = common border
• i and j share a border, then wij = 1
• i and j are not neighbors, then wij = 0
• weights are 0 or 1, hence binary
Distance-Based Weights
• distance between points
• distance between polygon
centroids or central points
• distance-band weights:
wij nonzero for dij < d
less than a critical distance d
• k-nearest neighbor weights:
same number of neighbors for all
observations
potential problems with ties
41. Global vs. Local Statistics
• Global statistics (Clustering) – identify and
measure the pattern of the entire study area
– Do not indicate where specific patterns occur
• Local Statistics (Clusters) – identify variation
across the study area, focusing on individual
features and their relationships to nearby
features (i.e. specific areas of clustering)
42. Spatial Autocorrelation (Moran’s I)
• Global statistic
• Measures whether the pattern of feature values is clustered,
dispersed, or random.
• Compares the difference between the mean of the target
feature and the mean for all features to the difference
between the mean for each neighbor and the mean for all
features.
Mean of Target
Feature
Mean of each
neighbor
Mean of
all
features
43. Z-Score & P-value (ArcGIS)
• Very high or very low (negative) z-scores,
associated with very small p-values, are found in
the tails of the normal distribution
• it is unlikely that the observed spatial pattern
reflects the theoretical random pattern
represented by your null hypothesis (CSR)
• The null hypothesis for the pattern analysis tools
is Complete Spatial Randomness (CSR), either of
the features themselves or of the values
associated with those features.
http://resources.arcgis.com/en/help/main/10.
1/index.html#//005p00000006000000
44. Pseudo P-Value
• significance levels are dependent on the
number of permutations
• One-sided significance test
• For instance, if an observed Moran's I value is
higher than any of the randomly generated
Moran's I values, the pseudo p-value would be
1/100=0.01 for 99 permutations or
1/1,000=0.001 for 999 permutations
50. Spatial Autocorrelation (Getis –Ord General G High/Low Clustering)
Polygon Contiguity
Percent Black Population, Cook County, IL
If the z-score value is positive, the observed General G index is larger than the expected
General G index, indicating high values for the attribute are clustered in the study area
55. Anselin Local Moran’s I
• Local statistic
• Measures the strength of patterns for
each specific feature.
• Compares the value of each feature in a
pair to the mean value for all features in
the study area.
56. Anselin Local Moran’s I
• Positive I value:
– Feature is surrounded by features with similar values, either high or low.
– Feature is part of a cluster.
– Statistically significant clusters can consist of high values (HH) or low
values (LL)
• Negative I value:
– Feature is surrounded by features with dissimilar values.
– Feature is an outlier.
– Statistically significant outliers can be a feature with a high value
surrounded by features with low values (HL) or a feature with a low
value surrounded by features with high values (LH).
57. Anselin Local Moran’s I
• The z- scores and p-values are measures of statistical
significance which tell you whether or not to reject the
null hypothesis, feature by feature.
• Indicate whether the apparent similarity (or
dissimilarity) in values for a feature and its neighbors is
greater than one would expect in a random distribution.
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/clu
ster_and_outlier_analysis_colon_anselin_local_moran_s_i_spatial_statistics_.htm