SlideShare a Scribd company logo
1 of 84
An Introduction to Spatial
Analysis in Socio-Economic and
Health Research
Corey S. Sparks, PhD
Department of Demography
The University ofTexas at San Antonio
Outline
 1)What’s special about spatial?
 Good news
 Bad news
 2) Modes of thinking about spatial analysis
 Macro and Micro
 3) Concepts of space
 4) Spatial analysis is NOT statistics, but spatial statistics needs
spatial analysis
 5) Using this for something meaningful
What’s special about spatial?
 Spatial data have more information than ordinary data
 Think of them as a triplet
 Y, X, and Z, whereY is the variable of interest, X is some other
information that influencesY and Z is the geographic location
whereY occurred
 If our data aren’t spatial, we don’t have Z
 Spatial information is a key attribute of behavioral data
 This adds a potentially interesting attribute to any data we
collect
What’s special about spatial?
 Spatial data monkey with models
 Most analytical models have assumptions, spatial structure can
violate these models
 We typically want to jump into modeling, but without
acknowledging or handling directly, spatial data can make our
models meaningless
 Some of the problems are..
The ecological fallacy
 The tendency for aggregate data on a concept to show
correlations when individual data on a concept do not.
 In general the effect of aggregation bias, whereby those
studying macro-level data try to make conclusions or
statements about individual-level behavior
 This also is felt when you analyze data at a specific level, say
counties, your results are only generalizeable at that level,
not at the level of congressional districts, MSA’s or states.
 The often-arbitrary nature of aggregate units also needs to
be considered in such analysis.
The modifiable area unit problem
(MAUP)
 This is akin to the ecological fallacy and the notion of aggregation bias.
 The MAUP occurs when inferences about data change when the spatial
scale of observation is modified.
 i.e. at a county level there may be a significant association between income
and health, but at the state or national level this may become insignificant,
likewise at the individual level we may see the relationship disappear.
 This problem also exists when we suspect that a characteristic of an
aggregate unit is influencing an individual behavior, but because the
level at which aggregate data are available, we are unable to properly
measure the variable at the aggregate level.
 E.g. we suspect that neighborhood crime rates will the recidivism hazard
for a parolee, but we can only get crime rates at the census tract or county
level, so we cannot really measure the effect we want.
Spatial structure
 Structure is the idea that your data have an organization to
them that has a specific spatial dimension
 Think of a square grid
 Each cell in the grid can be though of as being neighbors of
other cells base on their proximity, distance, direction, etc.
 This structure generally influences data by making them non-
independent of one another
 At best, you can have a correlation with your neighbor
 At worst, your characteristics are a linear or nonlinear function
of your neighbors
Spatial heterogeneity
 Spatial heterogeneity is the idea that characteristics of a
population or a sample vary by location
 This can manifest itself by generating clusters of like
observations
 Statistically, this is bad because many models assume
constant variance, but if like observations are spatially co-
incident, then variance is not constant
Modes of thinking about spatial analysis
 Macro
 Observations are areas, or aggregates of individuals
 Processes generally involve interaction among these areas
 We can’t really get at variation within these areas
 Can’t see the people within the county
 All we have are counts of some variable of interest
 Think of you stereotypicalCensus tract, or county
 In social science, we often refer to such analyses as “ecological”
 Analyzing places
 Looking for trends an associations at the population level
Macro models
 An example
 I think the infant mortality rate in US counties is a function of
the socioeconomic status of the county residents
 Furthermore, I expect that characteristics of the built environment
in counties will further influence the infant mortality rate
 I might hypothesize that areas that are more “built up” may have
higher rates of infant mortality than less “built up” places
 My outcome is a rate
 My predictors are other rates
 Everything is measured at the county level
Modes of thinking in spatial analysis
 Micro
 Observations are individuals
 Processes may still involve interactions between
observations, but we often go beyond that
 Look at behaviors of people
 Why do we do what we do?
Multi-level models
 The concept that individuals are nested within a hierarchy and the behavior of
interest is determined by both the characteristics of the individual and the level
of nesting.
 This can be though of broadly as individuals within any hierarchy, such as a city
block, an organization, a villages, or a community.
 The spatial idea of multi-level analysis is that the individual is nested within a
distinct geographic unit.
 Substantively, something about that area is thought to influence our behavior of
interest.
 Sometimes, we have a specific idea how that happens, other times we have a
weak conceptual model
 Statistically these methods correct for the common occurrence of dependence
within aggregate units (i.e. people from the same area may be more alike than
expected by random chance alone) which violates the assumptions of most
linear regression models (iid residuals)
Concepts of space
 To start, let’s think about how to define space as a concept
 i.e. what levels of space are important for human behavior?
 Neighborhoods
 Villages/towns
 Social networks>social “space”
 Households
 Climatic zones
 Political/administrative boundaries
What we really want to get at
 What concept of space is important in my work?
 That’s for you to decide
 More often than not, as social scientists, we will undertake a macro
analysis or some form of nested micro analysis
 At the macro level, we have to conceptualize the space of our units
 Where are counties relative to one another
 Is the spatial heterogeneity in my outcome at the county level
 At the micro level, we have to justify why we believe the level of
nesting we can measure is in fact the right level in terms of our
outcome
 Just because I know what MSA a child lives in, doesn’t tell me much about
the neighborhood in which they grew up
Spheres of social integration
 Individuals exist within formal or more often informal
spheres of interaction.
 You can think of these as areas where common ideas unite
people together, or the range (meaning distance) of social
norms.
 Common ideology often unites individuals into groups and
groups into larger aggregates, look at the concept of
nationalism for example.
 They can likewise have a strictly a-spatial representation, such
as a belief network.
Some parting thoughts about space
 Realization that human behavior is a dynamic process, not a static
phenomena
 Use of the cross-section as opposed to the cohort (longitudinal
data) or with respect to changes in the neighborhood.
 Defining “meaningful” neighborhoods, not just available sampling
areas, how do individuals interact within these areas?
 How do individuals interact with their neighborhood?
 Ideas of agency and how an individual is an agent of change
 Do people choose neighborhoods?Or are some chosen by the
neighborhoods they are born into, or only the ones they can afford?
Context or Composition?
 Is it the places where people live that is important? (Context)
 Or
 Is it the nature of the people that live in the place?
(Composition)
 Again, when justifying a spatially oriented analysis, these
things need to be addressed
Spatial analysis is NOT statistics
 Enter the GIS
 GIS is the manager of spatial information
 Goes beyond the realm of relational databases by incorporating the
spatial context of all data
 Uses this as the overarching infrastructure for organizing
information
 The use of space as a tool
 The GIS allows us to edit, merge, split, add, delete all types of
information based purely on the spatial location of the data
 But, it is not statistics!
GIS operations as spatial analysis
 We use the GIS to help us manage and visualize spatial data
 There are thousands of specific tools offered in a standard GIS
 We can organize the tools of spatial analysis by what type of data
they use:
 Point patterns
 Areas (polygons)
 Images (raster)
 Interactions
 Networks
 Sure I’m missing something
Spatial statistics needs spatial analysis
 1)Without spatial analysis, we would miss out on the
interactions between our spatial information
 Unable to construct important variables for statistical analysis
 2) Without spatial analysis, we would be limited in our
visualization of our data, and our results
 From maps to 3-D visualizations
Wrap Up
 Spatial data is special
 For better or for worse
 Ignoring spatial structure can invalidate models
Exploratory Spatial Data
Analysis
Corey S. Sparks, PhD
Department of Demography
The University ofTexas at San Antonio
Exploratory Spatial DataAnalysis
What is ESDA?
 In exploratory data analysis, we are looking for trends and patterns
in data
 We are working under the assumption that the more one knows
about the data, the more effectively it may be used to develop,
test and refine theory
 This generally requires we follow two principles: skepticism and
openness
 One should be skeptical of measures that summarize data, since
they can conceal or misrepresent the most informative aspects of
data,
 and..
 We must be open to patterns in the data that we did not expect to
find, because these can often be the most revealing outcomes of
the analysis.
 We must avoid the temptation to automatically jump to the
confirmatory model of data analysis.
 The confirmatory model says: Do my data confirm the hypothesis
that X causesY.
 We would normally fit a linear model (or something close to it) and
use summary measures (means and variances) to test if the
pattern we observe in the data is real.
 The exploratory model says,
 To the contrary, “What do the data I have tell me about the
relationship between X andY. This lends to a much more open
range of alternative explanations
 The principle of EDA is explained best in the simple model:
 Data = smooth + rough
 The smooth bit is the underlying simplified structure of a set of observations.
This is the straight line that we expect a relationship to follow in linear
regression for example, or the smooth curve describing the distribution of
data: the pattern or regularity in the data.
 You can also think of the smooth as a central parameter of a distribution, the
mean or median for example
 The rough is the difference between our actual data, and the smooth. This
can be measured as the deviation of each point from the mean.
 Outliers, for example are very rough data, they have a large difference from
the central tendency in a distribution, where all the other data points tend to
clump
Doing some ESDA
 We will use new tools (namelyGeoDa) to do some Spatial EDA.
 The way GeoDa differs from traditional ways of viewing data (on paper,
or with summary statistics), is that it goes beyond presenting the data, to
visualizing the data.
 Data visualization is a dynamic process that incorporates multiple views
of the same information,
 i.e. we can examine a histogram that is linked to a scatter plot that is
linked to a map to visualize the distribution of a single variable, how it is
related to another and how it is patterned over space.
 We can also do what is called brushing the data, so we can select subsets
of the information and see if, there are distinct subsets of the data that
have different relational or spatial properties. This allows us to really
visualize how the processes we study unfolds over space and allows us to
see locations of potentially influential observations.
Some examples of ESDA items
 Histograms
 Boxplots
 Scatter plots
 Parallel coordinate plots
 Area statistics/Local statistics
Histograms
 Histograms are useful
tools for representing the
distribution of data
 They can be used to judge
central tendency (mean,
median, mode), variation
(standard deviation,
variance), modality
(unimodal or multi-
modal), and graphical
assessment of
distributional
assumptions (Normality)
Box Plots
 Box plots (or box and whisker
plots) are another useful tool for
examining the distribution of a
variable
 You can visualize the 5 number
summary of a variable
 Miniumum, Maximum, lower
quartile, Median, and upper
quartile
Upper
Quartile
Median
Lower
Quartile
Maximum
Minimum
IQR
Scatter Plots
 Scatterplots show
bivariate relationships
 This can give you a visual
indication of an
association between two
variables
 Positive association
(positive correlation)
 Negative association
(negative correlation)
 Also allows you to see
potential outliers
(abnormal observations)
in the data
Slight
positive
association
Potential Outlier
Parallel Coordinate
Plots
 Parallel coordinate
plots allow for
visualization of the
association between
multiple variables
(multivariate)
 Each variable is
plotted according to
its “coordinates” or
values
Local Statistics
 Local statistics allow
you to see how the
mean, or variation, of
a variable varies over
space
 You can generate a
local statistic by
weighting an ordinary
statistic by some kind
of spatial weight
 Example ->
 Property crime in San
Antonio,TX
Global vs. Local Statistics
 By global we imply that one statistic is used to adequately summarize
the data
 i.e. the mean or median
 Or, a regression model that is suitable for all areas in the data
 Local statistics are useful when the process you are studying varies over
space,
 i.e. different areas have different local values that might cluster together to
form a local deviation from the overall mean
 Or a regression model that accounts for the same level of variation in the
outcome in all locations
Autocorrelation
 This can occur in either space or time
 Really boils down to the non-independence between neighboring
values
 The values of our independent variable (or our dependent
variables) may be similar because
 Our values occur
 closely in time (temporal autocorrelation)
 closely in space (spatial autocorrelation)
Preliminaries to assessing autocorrelation
 Basic Assessment of Spatial Dependency
 Before we can model the dependency in spatial data, we must
first cover the ideas of creating and modeling neighborhoods in
our data.
 By neighborhoods, I mean the clustering or connectedness of
observations
 The exploratory methods we will cover depend on us knowing
how our data are arranged in space, who is next to who.
 This is important (as we will see later) because most correlation
in spatial data tends to die out as we get further away from a
specific location (Tobler’s law)
Tobler's first law of geography
 WaldoTobler (1970) suggested the “jokingly” first law of
geography
 “Everything is related to everything else, but near things are more
related than distant things”
 We can see this better in graphical form: We expect the
correlation between the attributes of two points to diminish as the
distance between them grows
One thing about autocorrelation
 Autocorrelation is
typically a local process
 Meaning it typically
dies out as distance
between observations
increase
 plot(exp(-.05*seq(1:100)), type="l", lwd=2,
ylab="Correlation", xlab="Distance")
W
 To identify which observations are close to one another, we useW
 W is the spatial weight matrix
 Square n by n matrix with 1/0 entries identifying if any two
observations are spatial neighbors
 How is this constructed?
 There are two typical ways in which we measure spatial
relationships
 Distance and contiguity
 In a distance based connectivity method, features (generally
points) are considered to be contiguous if they are within a given
radius of another point.The radius is really left up to the researcher
to decide.
 Likewise, we can calculate the distance matrix between a set of
points
 This is usually measured using the standard Euclidean distance
 Where x and y are coordinates (lat/long) of the point or polygon in
question, this is the as the crow flies distance
Distance based neighbors
More on distances
 There are lots of distance measures
 Manhattan distances
 d = |x1 – x2| + |y1- y2|
 These are city-block distances
 We can use these distances to create a neighborhood of points
using certain criteria
Who is who's neighbor
 There are many different criteria for deciding if two observations
are neighbors
 Generally two observations must be within a critical distance, d, to
be considered neighbors.
 This is the Minimum distance criteria, and is very popular.
 This will generate a matrix of binary variables describing the
neighborhood.
 We can also describe the neighborhoods in a continuous weighting
scheme based on the distance between them
Inverse Distance Weight wij=
1
dij
or
Inverse Squared Distance Weight wij=
1
dij
2
Contiguity/PolygonAdjacency
 Polygons are contiguous if they share common topology, like
an edge (line segment) or a vertex (point)
 Neighborhoods are created based on which observations are
judged “contiguous”
 This is generally the best way to treat polygon features
 Distances aren’t typically used for polygons for several
reasons
 Spacing of observations
 What centroid? Is it a good measure?
• Rook adjacency
• Neighbors must share
a line segment
• Queen adjacency
• Neighbors must share
a vertex or a line
segment
• If polygons share these
boundaries (based on the
specific definition: rook or
queen), they are given a
weight of 1 (adjacent), else
they are given a value 0,
(nonadjacent)
Order of adjacency
Observation of interest
First order neighbors (Rook)
Second Order neighbors (Rook)
Observation of interest
First order neighbors (Queen)
Second Order neighbors (Queen)
First and second order
Rook adjacency
First and second order
Queen adjacency
What does a spatial weight matrix look like?
Set of polygons
SpatialWeight Matrix,W
StandardizedW
 Most times, in analytical work,W is standardized
 Typically this is done by dividing each 1/0 element in the row
by the sum of the row, which would yield this matrix:
Polygon 1 2 3 4
1 0 0.5 0.5 0
2 0.5 0 0 0.5
3 0.5 0 0 0.5
4 0 0.5 0.5 0
Forms of autocorrelation
 Positive autocorrelation
 This means that a feature is positively associated with the values
of the surrounding area (as defined by the spatial weight matrix),
high values occur with high values, low with low
 Negative autocorrelation
 This means that a feature is negatively associated with the values
of the surrounding area (as defined by the spatial weight matrix),
high with low, low with high
 The (probably) most popular global autocorrelation statistic is
Moran’s I
Measuring spatial autocorrelation:
Common measures
 Moran’s I
 Geary’s C
 Getis-Ord G
 All typically give similar results
Moran's I
 Measure of standardized spatial autocovariance
 with xi being the value of the attribute at location i, xj being the value of the
attribute at location j,
 S0 is the sum of all spatial weights
 wij is the weight for location ij (0 if they are not neighbors, 1 otherwise)
 Approximately bound on 0,1 with a similar interpretation as a Pearson or
Spearman correlation
Geary’s C
 Measure of spatial covariance
 with xi being the value of the attribute at location i, xj being the
value of the attribute at location j,
 S0 is the sum of all spatial weights
 wij is the weight for location ij (0 if they are not neighbors, 1
otherwise)
 Bound on 0,2
 0 to 1 = positive autocorrelation
 1= no autocorrelation
 1 to 2=negative autocorrelation
Getis – Ord G
 Unlike I and C, the interpretation of G focuses on whether
high values of x tend to cluster with other high values of x
 The measure is directional
 Likewise, low with low
Local Moran’s I
 We can also describe the local trends in autocorrelation in our
data by constructing local versions of our autocorrelation
statistics
 Each has a local version
 This is useful to see where the autocorrelation is high or low
within our data
 Goes beyond the global statistics to help us visualize where in
our data associations are strong
How do you determine if the data
are in a cluster?
 This is done via the Moran scatterplot.
 If an observation is in the top right quadrant of the plot =
high-high
 Bottom left quadrant = low – low
 Top left quadrant = high – low
 Bottom right quadrant = low-high
Moran scatterplot (univariate)
 It is sometimes useful to visualize the relationship between the actual
values of the dependent variable and their lagged values. This is the so
called
 Moran scatterplot
 Lagged values are the average value of the surrounding neighborhood
around location I
 lag(x) = wij*x = Wx in matrix terms
 The Moran scatterplot shows the association between the observation of
interest and its neighborhood's average value
 The variables are generally plotted as z-scores, to avoid scaling issues
 Moran
scatterplot for
San Antonio
neighborhood
deprivation
 I = .688
 High Positive
autocorrelation
 Local Moran
cluster map
 Significant high
neighborhood
deprivation on
the west, east
and south sides
 Significant low
neighborhood
deprivation on
the north side
Spatially lagged values
 If we have a value xi at location i and a spatial weight matrix wij
describing the spatial neighborhood around location i, we can find the
lagged value of the variable by:
 Wxi = xi * wij
 This calculates what is effectively, the neighborhood average value in
locations around location i, often stated x-i
Moran scatterplot (Bivariate)
 We can also compare the lagged value of one variable verses the
value of another variable
 Wy versus x
 This is the so-called “multivariate Moran scatterplot”
 It is often useful in so-called space-time analysis, when we
compare the value of one variable measured at two different time
points. This will show the correlation between space over time.
 Positive autocorrelation
between a tract’s
deprivation and the
average minority
concentration in the
neighboring tracts
 I = .537
 What these methods tell you
 Moran's I is a descriptive statistic
 It simply indicates if there is spatial association/autocorrelation
in a variable
 Local Moran's I tells you if there is significant localized
clustering of the variable
 Where spatial clusters and what type of cluster is present
Two examples of spatial
analysis and statistics
Goals
 To show how you can use R, software that is free and
available in the RDC to:
 1) Perform interesting spatially-oriented analysis
 2) Merge spatial data from multiple sources
 Spatial join
 3)Visualize results from a spatial analysis much as you would in
a GIS environment
Without a GIS!
Residential Segregation and Crime in San
Antonio,TX
 Four data sources:
 American Community Survey 5 year summary file
 Census tract level
 Census 2000 Summary File 1
 Block level
 SanAntonio Police Department call data
 Point data based on geocoded addresses
 All calls received by the SAPD in a year
 CensusTIGER file for SanAntonio tracts
Methods
 1)Merge ACS calculated fields to shapefile based on tract FIPS code
 2) Perform spatial join of crimes to tracts
 Count # of crimes in each tract
 3) Create a thematic map of the crime rate using quantile breaks
 4) Create tract-level indices of residential segregation by
aggregating up from the block level
 Merge these to the shapefile
 5) Perform ESDA using Local Moran statistics on the segregation
index and crime rate
Thematic Map/Histogram
Local Moran Cluster Maps
Spatial Disease Mapping
 Application to crime data, but principles remain the same
 Trying to locate areas of excess disease risk
 Finding spatial clusters of disease
 Kulldorff and Nagarwalla’s Spatial Scan Statistic
 DCluster library
 Modeling spatial relative risk
 Bayesian Hierarchical Regression
 Posterior model summaries
 INLA library
 Approximate Bayesian inference by Integrated Nested Laplace
Approximation
 Great for Gaussian random field models
Methods
 Spatial Scan Statistic
 Constructs a grid over the area, centered on each observation
 Using a set radius (defined based on a proportion of the
population size at risk), a circle is created on each point
 The likelihood function
 is used, and the area with the maximum value of the function is
the most likely cluster of cases
 Allows for ranked clusters (primary, secondary, etc)
Scan Statistic
You know where this is?
The Riverwalk
Take home message:
Watch your wallet!
Basics of Bayesian Modeling
 Some statistical ideas:
 Likelihood
 Usually we have some data that we assume follow some distribution
 For counts, maybeY ~ Poisson(λ)
 L (y, λ) is the joint probability distribution of the data and the
parameters
 The model likelihood is the product of this distribution for all
observations
 We get the estimate of λ that is the most likely to have generated our
data (y)
 Maximum likelihood estimate of λ
 This is traditional statistics
Bayesian statistics
 Now, lets consider the case where we have some knowledge
about λ, say we believe it comes from a certain distribution,
so :
 λ ~ Gamma(α,β)
 And we can either assume some value for these parameters
(α=β=1) or allow them to also come from some distribution
 α~Exp(u) ,β~Exp(p)
 Why a Gamma distn? Because Gamma is >0, and we know θ is
>0 because it is a relative risk
Combining information
 When we combine the likelihood and the prior, we form what
is called the posterior distribution
 Thus we have BayesTheorem which in the continuous case
is:
 Which states, the posterior distribution of θ, conditional on y
is the product of the likelihood and the prior distribution of θ
 The denominator in Bayes theorem is a constant, and this is
generally written as:
 Which says the posterior is proportional to the likelihood
times the prior
A simple Poisson – Gamma model
Simple Models
 Let’s now consider a simple realistic model
 yi~Poisson(ei θi)
 Assume the usual log link function
 log (θi) = C1+C2
 C1=a0+x’β
 Linear Poisson regression with intercept
 C2=other terms
 Could add random effect - > this would be like an individual frailty model, or a
group frailty model or a random intercept model
 Generally this is called a Generalized Linear Mixed Model
 Now we just need to code it up!
Mapping and spatial modeling
 We just achieved a better, more accurate SMR map with a
simple Poisson model with a random effect
 This model “drew strength” from counties with good data to
help smooth counties with poor data, but it didn’t account for
correlation between neighboring counties
 We can also build a model that includes a spatially correlated
random effect between counties
 Spatially smoothed rates
Building a Spatial Model
 Take a Poisson-Lognormal model
 y~Pois(ei θi)
 θi=α+ x’β + ui
 What if we introduce more interesting structure on u?
 A common form of spatial correlation structure is the CAR model, or
also called the Besag model
 This is:
More models
 Another model that is commonly used in practice is the
convolution model, or the Besag,York and Mollie model
 y~Pois(ei θi)
 θi=α+ x’β + ui + vi
 Where now ui is a correlated heterogeneity (CH) term and vi
is an Uncorrelated Heterogeneity (UH) term
Posterior Parameter Estimates
Why Use R?
 R is free (So are GeoDA and QGIS)
 R is available in the RDC (So are GeoDA and QGIS)
 R is extremely capable and flexible (So is QGIS)
 R is a scripting and statistical language (Neither GeoDA or
QGIS are)
 R is one stop shopping for many geospatial techniques
 Spatial joins, projections, data merging, raster analysis, vector
operations

More Related Content

What's hot

Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
Wael Elrifai
 

What's hot (19)

09 Ego Network Analysis
09 Ego Network Analysis09 Ego Network Analysis
09 Ego Network Analysis
 
07 Whole Network Descriptive Statistics
07 Whole Network Descriptive Statistics07 Whole Network Descriptive Statistics
07 Whole Network Descriptive Statistics
 
Hybrid sentiment and network analysis of social opinion polarization icoict
Hybrid sentiment and network analysis of social opinion polarization   icoictHybrid sentiment and network analysis of social opinion polarization   icoict
Hybrid sentiment and network analysis of social opinion polarization icoict
 
15 Network Visualization and Communities
15 Network Visualization and Communities15 Network Visualization and Communities
15 Network Visualization and Communities
 
03 Ego Network Analysis
03 Ego Network Analysis03 Ego Network Analysis
03 Ego Network Analysis
 
Ties strength, similarities and surprise to disrupt the echo chamber effect
Ties strength, similarities and surprise to disrupt the echo chamber effectTies strength, similarities and surprise to disrupt the echo chamber effect
Ties strength, similarities and surprise to disrupt the echo chamber effect
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
 
10 Strength Of Weak Ties
10 Strength Of Weak Ties10 Strength Of Weak Ties
10 Strength Of Weak Ties
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
The strength of weak ties
The strength of weak tiesThe strength of weak ties
The strength of weak ties
 
The Effect of Correlation Coefficients on Communities of Recommenders
The Effect of Correlation Coefficients on Communities of RecommendersThe Effect of Correlation Coefficients on Communities of Recommenders
The Effect of Correlation Coefficients on Communities of Recommenders
 
Visualizing Big Data - Social Network Analysis
Visualizing Big Data - Social Network AnalysisVisualizing Big Data - Social Network Analysis
Visualizing Big Data - Social Network Analysis
 
Enhanced ability of information gathering may intensify disagreement among gr...
Enhanced ability of information gathering may intensify disagreement among gr...Enhanced ability of information gathering may intensify disagreement among gr...
Enhanced ability of information gathering may intensify disagreement among gr...
 
Social Network Analysis: An Overview
Social Network Analysis: An OverviewSocial Network Analysis: An Overview
Social Network Analysis: An Overview
 
06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)
 
Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networks
 
Complex Adaptive Systems and Communities
Complex Adaptive Systems and CommunitiesComplex Adaptive Systems and Communities
Complex Adaptive Systems and Communities
 
Social Network, Metrics and Computational Problem
Social Network, Metrics and Computational ProblemSocial Network, Metrics and Computational Problem
Social Network, Metrics and Computational Problem
 

Viewers also liked (7)

Rotterdam 2013
Rotterdam 2013Rotterdam 2013
Rotterdam 2013
 
Dissenys d estudis epidemiologics
Dissenys d estudis epidemiologicsDissenys d estudis epidemiologics
Dissenys d estudis epidemiologics
 
Logic of social inquiry
Logic of social inquiryLogic of social inquiry
Logic of social inquiry
 
Ecological study
Ecological studyEcological study
Ecological study
 
Ecological Study
Ecological StudyEcological Study
Ecological Study
 
Bias in health research
Bias in health researchBias in health research
Bias in health research
 
Bias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiologyBias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiology
 

Similar to Spatial statistics presentation Texas A&M Census RDC

Process_Method_versus_The_Hypothesis_Method
Process_Method_versus_The_Hypothesis_MethodProcess_Method_versus_The_Hypothesis_Method
Process_Method_versus_The_Hypothesis_Method
Richard Wilkie
 
organic information design
organic information designorganic information design
organic information design
rosa qin
 
2013.05.04 deel 1.1 overtoom
2013.05.04 deel 1.1 overtoom2013.05.04 deel 1.1 overtoom
2013.05.04 deel 1.1 overtoom
Praktijkleerstoel
 
Karenclassslides
KarenclassslidesKarenclassslides
Karenclassslides
kcarter14
 
Patterns based information systems organization (@ InFuture 2015)
Patterns based information systems organization (@ InFuture 2015) Patterns based information systems organization (@ InFuture 2015)
Patterns based information systems organization (@ InFuture 2015)
Sergej Lugovic
 

Similar to Spatial statistics presentation Texas A&M Census RDC (20)

Theory of Mind: A Neural Prediction Problem
Theory of Mind: A Neural Prediction ProblemTheory of Mind: A Neural Prediction Problem
Theory of Mind: A Neural Prediction Problem
 
Spatial Essay
Spatial EssaySpatial Essay
Spatial Essay
 
Spatial data analysis 2
Spatial data analysis 2Spatial data analysis 2
Spatial data analysis 2
 
Introduction to Computational Social Science
Introduction to Computational Social ScienceIntroduction to Computational Social Science
Introduction to Computational Social Science
 
Process_Method_versus_The_Hypothesis_Method
Process_Method_versus_The_Hypothesis_MethodProcess_Method_versus_The_Hypothesis_Method
Process_Method_versus_The_Hypothesis_Method
 
SN- Lecture 3
SN- Lecture 3SN- Lecture 3
SN- Lecture 3
 
organic information design
organic information designorganic information design
organic information design
 
Spatial Zones And Body Language
Spatial Zones And Body LanguageSpatial Zones And Body Language
Spatial Zones And Body Language
 
Geographic Information: Aspects of Phenomenology and Cognition
Geographic Information: Aspects of Phenomenology and CognitionGeographic Information: Aspects of Phenomenology and Cognition
Geographic Information: Aspects of Phenomenology and Cognition
 
Sociological Methodology
Sociological MethodologySociological Methodology
Sociological Methodology
 
Building Simulations from Expert Knowledge: Understanding Needle Sharing Beha...
Building Simulations from Expert Knowledge: Understanding Needle Sharing Beha...Building Simulations from Expert Knowledge: Understanding Needle Sharing Beha...
Building Simulations from Expert Knowledge: Understanding Needle Sharing Beha...
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...
 
APLIC 2014 - Social Observatories Coordinating Network
APLIC 2014 - Social Observatories Coordinating NetworkAPLIC 2014 - Social Observatories Coordinating Network
APLIC 2014 - Social Observatories Coordinating Network
 
Essay Writing Skill
Essay Writing SkillEssay Writing Skill
Essay Writing Skill
 
2013.05.04 deel 1.1 overtoom
2013.05.04 deel 1.1 overtoom2013.05.04 deel 1.1 overtoom
2013.05.04 deel 1.1 overtoom
 
Spatial thinking in Geography.pptx
Spatial thinking in Geography.pptxSpatial thinking in Geography.pptx
Spatial thinking in Geography.pptx
 
Mathematics and Social Networks
Mathematics and Social NetworksMathematics and Social Networks
Mathematics and Social Networks
 
Karenclassslides
KarenclassslidesKarenclassslides
Karenclassslides
 
Patterns based information systems organization (@ InFuture 2015)
Patterns based information systems organization (@ InFuture 2015) Patterns based information systems organization (@ InFuture 2015)
Patterns based information systems organization (@ InFuture 2015)
 
The Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social ScienceThe Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social Science
 

More from Corey Sparks

More from Corey Sparks (10)

Dem 7263 fall 2015 Spatial GLM's
Dem 7263 fall 2015 Spatial GLM'sDem 7263 fall 2015 Spatial GLM's
Dem 7263 fall 2015 Spatial GLM's
 
Dem 7263 fall 2015 spatially autoregressive models 1
Dem 7263 fall 2015   spatially autoregressive models 1Dem 7263 fall 2015   spatially autoregressive models 1
Dem 7263 fall 2015 spatially autoregressive models 1
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2
 
Sparks & Sparks Spatiotemporal persistence of residential segregation SSSA 2013
Sparks & Sparks Spatiotemporal persistence of residential segregation SSSA 2013Sparks & Sparks Spatiotemporal persistence of residential segregation SSSA 2013
Sparks & Sparks Spatiotemporal persistence of residential segregation SSSA 2013
 
Sparks and Valencia PAA 2014 session107
Sparks and Valencia PAA 2014 session107Sparks and Valencia PAA 2014 session107
Sparks and Valencia PAA 2014 session107
 
Infant mortality paper
Infant mortality paperInfant mortality paper
Infant mortality paper
 
Campbell sparkspaa12
Campbell sparkspaa12Campbell sparkspaa12
Campbell sparkspaa12
 
R meet up slides.pptx
R meet up slides.pptxR meet up slides.pptx
R meet up slides.pptx
 
San Antonio Food Insecurity Assessment
San Antonio Food Insecurity AssessmentSan Antonio Food Insecurity Assessment
San Antonio Food Insecurity Assessment
 
A socio ecological model of injury mortality in Texas using Bayesian modesl
A socio ecological model of injury mortality in Texas using Bayesian modeslA socio ecological model of injury mortality in Texas using Bayesian modesl
A socio ecological model of injury mortality in Texas using Bayesian modesl
 

Recently uploaded

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 

Spatial statistics presentation Texas A&M Census RDC

  • 1. An Introduction to Spatial Analysis in Socio-Economic and Health Research Corey S. Sparks, PhD Department of Demography The University ofTexas at San Antonio
  • 2. Outline  1)What’s special about spatial?  Good news  Bad news  2) Modes of thinking about spatial analysis  Macro and Micro  3) Concepts of space  4) Spatial analysis is NOT statistics, but spatial statistics needs spatial analysis  5) Using this for something meaningful
  • 3. What’s special about spatial?  Spatial data have more information than ordinary data  Think of them as a triplet  Y, X, and Z, whereY is the variable of interest, X is some other information that influencesY and Z is the geographic location whereY occurred  If our data aren’t spatial, we don’t have Z  Spatial information is a key attribute of behavioral data  This adds a potentially interesting attribute to any data we collect
  • 4. What’s special about spatial?  Spatial data monkey with models  Most analytical models have assumptions, spatial structure can violate these models  We typically want to jump into modeling, but without acknowledging or handling directly, spatial data can make our models meaningless  Some of the problems are..
  • 5. The ecological fallacy  The tendency for aggregate data on a concept to show correlations when individual data on a concept do not.  In general the effect of aggregation bias, whereby those studying macro-level data try to make conclusions or statements about individual-level behavior  This also is felt when you analyze data at a specific level, say counties, your results are only generalizeable at that level, not at the level of congressional districts, MSA’s or states.  The often-arbitrary nature of aggregate units also needs to be considered in such analysis.
  • 6. The modifiable area unit problem (MAUP)  This is akin to the ecological fallacy and the notion of aggregation bias.  The MAUP occurs when inferences about data change when the spatial scale of observation is modified.  i.e. at a county level there may be a significant association between income and health, but at the state or national level this may become insignificant, likewise at the individual level we may see the relationship disappear.  This problem also exists when we suspect that a characteristic of an aggregate unit is influencing an individual behavior, but because the level at which aggregate data are available, we are unable to properly measure the variable at the aggregate level.  E.g. we suspect that neighborhood crime rates will the recidivism hazard for a parolee, but we can only get crime rates at the census tract or county level, so we cannot really measure the effect we want.
  • 7. Spatial structure  Structure is the idea that your data have an organization to them that has a specific spatial dimension  Think of a square grid  Each cell in the grid can be though of as being neighbors of other cells base on their proximity, distance, direction, etc.  This structure generally influences data by making them non- independent of one another  At best, you can have a correlation with your neighbor  At worst, your characteristics are a linear or nonlinear function of your neighbors
  • 8. Spatial heterogeneity  Spatial heterogeneity is the idea that characteristics of a population or a sample vary by location  This can manifest itself by generating clusters of like observations  Statistically, this is bad because many models assume constant variance, but if like observations are spatially co- incident, then variance is not constant
  • 9. Modes of thinking about spatial analysis  Macro  Observations are areas, or aggregates of individuals  Processes generally involve interaction among these areas  We can’t really get at variation within these areas  Can’t see the people within the county  All we have are counts of some variable of interest  Think of you stereotypicalCensus tract, or county  In social science, we often refer to such analyses as “ecological”  Analyzing places  Looking for trends an associations at the population level
  • 10. Macro models  An example  I think the infant mortality rate in US counties is a function of the socioeconomic status of the county residents  Furthermore, I expect that characteristics of the built environment in counties will further influence the infant mortality rate  I might hypothesize that areas that are more “built up” may have higher rates of infant mortality than less “built up” places  My outcome is a rate  My predictors are other rates  Everything is measured at the county level
  • 11. Modes of thinking in spatial analysis  Micro  Observations are individuals  Processes may still involve interactions between observations, but we often go beyond that  Look at behaviors of people  Why do we do what we do?
  • 12. Multi-level models  The concept that individuals are nested within a hierarchy and the behavior of interest is determined by both the characteristics of the individual and the level of nesting.  This can be though of broadly as individuals within any hierarchy, such as a city block, an organization, a villages, or a community.  The spatial idea of multi-level analysis is that the individual is nested within a distinct geographic unit.  Substantively, something about that area is thought to influence our behavior of interest.  Sometimes, we have a specific idea how that happens, other times we have a weak conceptual model  Statistically these methods correct for the common occurrence of dependence within aggregate units (i.e. people from the same area may be more alike than expected by random chance alone) which violates the assumptions of most linear regression models (iid residuals)
  • 13. Concepts of space  To start, let’s think about how to define space as a concept  i.e. what levels of space are important for human behavior?  Neighborhoods  Villages/towns  Social networks>social “space”  Households  Climatic zones  Political/administrative boundaries
  • 14. What we really want to get at  What concept of space is important in my work?  That’s for you to decide  More often than not, as social scientists, we will undertake a macro analysis or some form of nested micro analysis  At the macro level, we have to conceptualize the space of our units  Where are counties relative to one another  Is the spatial heterogeneity in my outcome at the county level  At the micro level, we have to justify why we believe the level of nesting we can measure is in fact the right level in terms of our outcome  Just because I know what MSA a child lives in, doesn’t tell me much about the neighborhood in which they grew up
  • 15. Spheres of social integration  Individuals exist within formal or more often informal spheres of interaction.  You can think of these as areas where common ideas unite people together, or the range (meaning distance) of social norms.  Common ideology often unites individuals into groups and groups into larger aggregates, look at the concept of nationalism for example.  They can likewise have a strictly a-spatial representation, such as a belief network.
  • 16. Some parting thoughts about space  Realization that human behavior is a dynamic process, not a static phenomena  Use of the cross-section as opposed to the cohort (longitudinal data) or with respect to changes in the neighborhood.  Defining “meaningful” neighborhoods, not just available sampling areas, how do individuals interact within these areas?  How do individuals interact with their neighborhood?  Ideas of agency and how an individual is an agent of change  Do people choose neighborhoods?Or are some chosen by the neighborhoods they are born into, or only the ones they can afford?
  • 17. Context or Composition?  Is it the places where people live that is important? (Context)  Or  Is it the nature of the people that live in the place? (Composition)  Again, when justifying a spatially oriented analysis, these things need to be addressed
  • 18. Spatial analysis is NOT statistics  Enter the GIS  GIS is the manager of spatial information  Goes beyond the realm of relational databases by incorporating the spatial context of all data  Uses this as the overarching infrastructure for organizing information  The use of space as a tool  The GIS allows us to edit, merge, split, add, delete all types of information based purely on the spatial location of the data  But, it is not statistics!
  • 19. GIS operations as spatial analysis  We use the GIS to help us manage and visualize spatial data  There are thousands of specific tools offered in a standard GIS  We can organize the tools of spatial analysis by what type of data they use:  Point patterns  Areas (polygons)  Images (raster)  Interactions  Networks  Sure I’m missing something
  • 20. Spatial statistics needs spatial analysis  1)Without spatial analysis, we would miss out on the interactions between our spatial information  Unable to construct important variables for statistical analysis  2) Without spatial analysis, we would be limited in our visualization of our data, and our results  From maps to 3-D visualizations
  • 21. Wrap Up  Spatial data is special  For better or for worse  Ignoring spatial structure can invalidate models
  • 22.
  • 23. Exploratory Spatial Data Analysis Corey S. Sparks, PhD Department of Demography The University ofTexas at San Antonio
  • 24. Exploratory Spatial DataAnalysis What is ESDA?  In exploratory data analysis, we are looking for trends and patterns in data  We are working under the assumption that the more one knows about the data, the more effectively it may be used to develop, test and refine theory  This generally requires we follow two principles: skepticism and openness
  • 25.  One should be skeptical of measures that summarize data, since they can conceal or misrepresent the most informative aspects of data,  and..  We must be open to patterns in the data that we did not expect to find, because these can often be the most revealing outcomes of the analysis.  We must avoid the temptation to automatically jump to the confirmatory model of data analysis.
  • 26.  The confirmatory model says: Do my data confirm the hypothesis that X causesY.  We would normally fit a linear model (or something close to it) and use summary measures (means and variances) to test if the pattern we observe in the data is real.  The exploratory model says,  To the contrary, “What do the data I have tell me about the relationship between X andY. This lends to a much more open range of alternative explanations
  • 27.  The principle of EDA is explained best in the simple model:  Data = smooth + rough  The smooth bit is the underlying simplified structure of a set of observations. This is the straight line that we expect a relationship to follow in linear regression for example, or the smooth curve describing the distribution of data: the pattern or regularity in the data.  You can also think of the smooth as a central parameter of a distribution, the mean or median for example  The rough is the difference between our actual data, and the smooth. This can be measured as the deviation of each point from the mean.  Outliers, for example are very rough data, they have a large difference from the central tendency in a distribution, where all the other data points tend to clump
  • 28. Doing some ESDA  We will use new tools (namelyGeoDa) to do some Spatial EDA.  The way GeoDa differs from traditional ways of viewing data (on paper, or with summary statistics), is that it goes beyond presenting the data, to visualizing the data.  Data visualization is a dynamic process that incorporates multiple views of the same information,  i.e. we can examine a histogram that is linked to a scatter plot that is linked to a map to visualize the distribution of a single variable, how it is related to another and how it is patterned over space.  We can also do what is called brushing the data, so we can select subsets of the information and see if, there are distinct subsets of the data that have different relational or spatial properties. This allows us to really visualize how the processes we study unfolds over space and allows us to see locations of potentially influential observations.
  • 29. Some examples of ESDA items  Histograms  Boxplots  Scatter plots  Parallel coordinate plots  Area statistics/Local statistics
  • 30. Histograms  Histograms are useful tools for representing the distribution of data  They can be used to judge central tendency (mean, median, mode), variation (standard deviation, variance), modality (unimodal or multi- modal), and graphical assessment of distributional assumptions (Normality)
  • 31. Box Plots  Box plots (or box and whisker plots) are another useful tool for examining the distribution of a variable  You can visualize the 5 number summary of a variable  Miniumum, Maximum, lower quartile, Median, and upper quartile Upper Quartile Median Lower Quartile Maximum Minimum IQR
  • 32. Scatter Plots  Scatterplots show bivariate relationships  This can give you a visual indication of an association between two variables  Positive association (positive correlation)  Negative association (negative correlation)  Also allows you to see potential outliers (abnormal observations) in the data Slight positive association Potential Outlier
  • 33. Parallel Coordinate Plots  Parallel coordinate plots allow for visualization of the association between multiple variables (multivariate)  Each variable is plotted according to its “coordinates” or values
  • 34. Local Statistics  Local statistics allow you to see how the mean, or variation, of a variable varies over space  You can generate a local statistic by weighting an ordinary statistic by some kind of spatial weight  Example ->  Property crime in San Antonio,TX
  • 35. Global vs. Local Statistics  By global we imply that one statistic is used to adequately summarize the data  i.e. the mean or median  Or, a regression model that is suitable for all areas in the data  Local statistics are useful when the process you are studying varies over space,  i.e. different areas have different local values that might cluster together to form a local deviation from the overall mean  Or a regression model that accounts for the same level of variation in the outcome in all locations
  • 36. Autocorrelation  This can occur in either space or time  Really boils down to the non-independence between neighboring values  The values of our independent variable (or our dependent variables) may be similar because  Our values occur  closely in time (temporal autocorrelation)  closely in space (spatial autocorrelation)
  • 37. Preliminaries to assessing autocorrelation  Basic Assessment of Spatial Dependency  Before we can model the dependency in spatial data, we must first cover the ideas of creating and modeling neighborhoods in our data.  By neighborhoods, I mean the clustering or connectedness of observations  The exploratory methods we will cover depend on us knowing how our data are arranged in space, who is next to who.  This is important (as we will see later) because most correlation in spatial data tends to die out as we get further away from a specific location (Tobler’s law)
  • 38. Tobler's first law of geography  WaldoTobler (1970) suggested the “jokingly” first law of geography  “Everything is related to everything else, but near things are more related than distant things”  We can see this better in graphical form: We expect the correlation between the attributes of two points to diminish as the distance between them grows
  • 39. One thing about autocorrelation  Autocorrelation is typically a local process  Meaning it typically dies out as distance between observations increase  plot(exp(-.05*seq(1:100)), type="l", lwd=2, ylab="Correlation", xlab="Distance")
  • 40. W  To identify which observations are close to one another, we useW  W is the spatial weight matrix  Square n by n matrix with 1/0 entries identifying if any two observations are spatial neighbors  How is this constructed?  There are two typical ways in which we measure spatial relationships  Distance and contiguity
  • 41.  In a distance based connectivity method, features (generally points) are considered to be contiguous if they are within a given radius of another point.The radius is really left up to the researcher to decide.  Likewise, we can calculate the distance matrix between a set of points  This is usually measured using the standard Euclidean distance  Where x and y are coordinates (lat/long) of the point or polygon in question, this is the as the crow flies distance Distance based neighbors
  • 42. More on distances  There are lots of distance measures  Manhattan distances  d = |x1 – x2| + |y1- y2|  These are city-block distances  We can use these distances to create a neighborhood of points using certain criteria
  • 43. Who is who's neighbor  There are many different criteria for deciding if two observations are neighbors  Generally two observations must be within a critical distance, d, to be considered neighbors.  This is the Minimum distance criteria, and is very popular.  This will generate a matrix of binary variables describing the neighborhood.  We can also describe the neighborhoods in a continuous weighting scheme based on the distance between them Inverse Distance Weight wij= 1 dij or Inverse Squared Distance Weight wij= 1 dij 2
  • 44. Contiguity/PolygonAdjacency  Polygons are contiguous if they share common topology, like an edge (line segment) or a vertex (point)  Neighborhoods are created based on which observations are judged “contiguous”  This is generally the best way to treat polygon features  Distances aren’t typically used for polygons for several reasons  Spacing of observations  What centroid? Is it a good measure?
  • 45. • Rook adjacency • Neighbors must share a line segment • Queen adjacency • Neighbors must share a vertex or a line segment • If polygons share these boundaries (based on the specific definition: rook or queen), they are given a weight of 1 (adjacent), else they are given a value 0, (nonadjacent)
  • 46. Order of adjacency Observation of interest First order neighbors (Rook) Second Order neighbors (Rook) Observation of interest First order neighbors (Queen) Second Order neighbors (Queen) First and second order Rook adjacency First and second order Queen adjacency
  • 47. What does a spatial weight matrix look like? Set of polygons SpatialWeight Matrix,W
  • 48. StandardizedW  Most times, in analytical work,W is standardized  Typically this is done by dividing each 1/0 element in the row by the sum of the row, which would yield this matrix: Polygon 1 2 3 4 1 0 0.5 0.5 0 2 0.5 0 0 0.5 3 0.5 0 0 0.5 4 0 0.5 0.5 0
  • 49. Forms of autocorrelation  Positive autocorrelation  This means that a feature is positively associated with the values of the surrounding area (as defined by the spatial weight matrix), high values occur with high values, low with low  Negative autocorrelation  This means that a feature is negatively associated with the values of the surrounding area (as defined by the spatial weight matrix), high with low, low with high  The (probably) most popular global autocorrelation statistic is Moran’s I
  • 50. Measuring spatial autocorrelation: Common measures  Moran’s I  Geary’s C  Getis-Ord G  All typically give similar results
  • 51. Moran's I  Measure of standardized spatial autocovariance  with xi being the value of the attribute at location i, xj being the value of the attribute at location j,  S0 is the sum of all spatial weights  wij is the weight for location ij (0 if they are not neighbors, 1 otherwise)  Approximately bound on 0,1 with a similar interpretation as a Pearson or Spearman correlation
  • 52. Geary’s C  Measure of spatial covariance  with xi being the value of the attribute at location i, xj being the value of the attribute at location j,  S0 is the sum of all spatial weights  wij is the weight for location ij (0 if they are not neighbors, 1 otherwise)  Bound on 0,2  0 to 1 = positive autocorrelation  1= no autocorrelation  1 to 2=negative autocorrelation
  • 53. Getis – Ord G  Unlike I and C, the interpretation of G focuses on whether high values of x tend to cluster with other high values of x  The measure is directional  Likewise, low with low
  • 54. Local Moran’s I  We can also describe the local trends in autocorrelation in our data by constructing local versions of our autocorrelation statistics  Each has a local version  This is useful to see where the autocorrelation is high or low within our data  Goes beyond the global statistics to help us visualize where in our data associations are strong
  • 55. How do you determine if the data are in a cluster?  This is done via the Moran scatterplot.  If an observation is in the top right quadrant of the plot = high-high  Bottom left quadrant = low – low  Top left quadrant = high – low  Bottom right quadrant = low-high
  • 56. Moran scatterplot (univariate)  It is sometimes useful to visualize the relationship between the actual values of the dependent variable and their lagged values. This is the so called  Moran scatterplot  Lagged values are the average value of the surrounding neighborhood around location I  lag(x) = wij*x = Wx in matrix terms  The Moran scatterplot shows the association between the observation of interest and its neighborhood's average value  The variables are generally plotted as z-scores, to avoid scaling issues
  • 57.  Moran scatterplot for San Antonio neighborhood deprivation  I = .688  High Positive autocorrelation
  • 58.  Local Moran cluster map  Significant high neighborhood deprivation on the west, east and south sides  Significant low neighborhood deprivation on the north side
  • 59. Spatially lagged values  If we have a value xi at location i and a spatial weight matrix wij describing the spatial neighborhood around location i, we can find the lagged value of the variable by:  Wxi = xi * wij  This calculates what is effectively, the neighborhood average value in locations around location i, often stated x-i
  • 60.
  • 61. Moran scatterplot (Bivariate)  We can also compare the lagged value of one variable verses the value of another variable  Wy versus x  This is the so-called “multivariate Moran scatterplot”  It is often useful in so-called space-time analysis, when we compare the value of one variable measured at two different time points. This will show the correlation between space over time.
  • 62.  Positive autocorrelation between a tract’s deprivation and the average minority concentration in the neighboring tracts  I = .537
  • 63.  What these methods tell you  Moran's I is a descriptive statistic  It simply indicates if there is spatial association/autocorrelation in a variable  Local Moran's I tells you if there is significant localized clustering of the variable  Where spatial clusters and what type of cluster is present
  • 64.
  • 65. Two examples of spatial analysis and statistics
  • 66. Goals  To show how you can use R, software that is free and available in the RDC to:  1) Perform interesting spatially-oriented analysis  2) Merge spatial data from multiple sources  Spatial join  3)Visualize results from a spatial analysis much as you would in a GIS environment Without a GIS!
  • 67. Residential Segregation and Crime in San Antonio,TX  Four data sources:  American Community Survey 5 year summary file  Census tract level  Census 2000 Summary File 1  Block level  SanAntonio Police Department call data  Point data based on geocoded addresses  All calls received by the SAPD in a year  CensusTIGER file for SanAntonio tracts
  • 68. Methods  1)Merge ACS calculated fields to shapefile based on tract FIPS code  2) Perform spatial join of crimes to tracts  Count # of crimes in each tract  3) Create a thematic map of the crime rate using quantile breaks  4) Create tract-level indices of residential segregation by aggregating up from the block level  Merge these to the shapefile  5) Perform ESDA using Local Moran statistics on the segregation index and crime rate
  • 71. Spatial Disease Mapping  Application to crime data, but principles remain the same  Trying to locate areas of excess disease risk  Finding spatial clusters of disease  Kulldorff and Nagarwalla’s Spatial Scan Statistic  DCluster library  Modeling spatial relative risk  Bayesian Hierarchical Regression  Posterior model summaries  INLA library  Approximate Bayesian inference by Integrated Nested Laplace Approximation  Great for Gaussian random field models
  • 72. Methods  Spatial Scan Statistic  Constructs a grid over the area, centered on each observation  Using a set radius (defined based on a proportion of the population size at risk), a circle is created on each point  The likelihood function  is used, and the area with the maximum value of the function is the most likely cluster of cases  Allows for ranked clusters (primary, secondary, etc)
  • 73. Scan Statistic You know where this is? The Riverwalk Take home message: Watch your wallet!
  • 74. Basics of Bayesian Modeling  Some statistical ideas:  Likelihood  Usually we have some data that we assume follow some distribution  For counts, maybeY ~ Poisson(λ)  L (y, λ) is the joint probability distribution of the data and the parameters  The model likelihood is the product of this distribution for all observations  We get the estimate of λ that is the most likely to have generated our data (y)  Maximum likelihood estimate of λ  This is traditional statistics
  • 75. Bayesian statistics  Now, lets consider the case where we have some knowledge about λ, say we believe it comes from a certain distribution, so :  λ ~ Gamma(α,β)  And we can either assume some value for these parameters (α=β=1) or allow them to also come from some distribution  α~Exp(u) ,β~Exp(p)  Why a Gamma distn? Because Gamma is >0, and we know θ is >0 because it is a relative risk
  • 76. Combining information  When we combine the likelihood and the prior, we form what is called the posterior distribution  Thus we have BayesTheorem which in the continuous case is:  Which states, the posterior distribution of θ, conditional on y is the product of the likelihood and the prior distribution of θ
  • 77.  The denominator in Bayes theorem is a constant, and this is generally written as:  Which says the posterior is proportional to the likelihood times the prior
  • 78. A simple Poisson – Gamma model
  • 79. Simple Models  Let’s now consider a simple realistic model  yi~Poisson(ei θi)  Assume the usual log link function  log (θi) = C1+C2  C1=a0+x’β  Linear Poisson regression with intercept  C2=other terms  Could add random effect - > this would be like an individual frailty model, or a group frailty model or a random intercept model  Generally this is called a Generalized Linear Mixed Model  Now we just need to code it up!
  • 80. Mapping and spatial modeling  We just achieved a better, more accurate SMR map with a simple Poisson model with a random effect  This model “drew strength” from counties with good data to help smooth counties with poor data, but it didn’t account for correlation between neighboring counties  We can also build a model that includes a spatially correlated random effect between counties  Spatially smoothed rates
  • 81. Building a Spatial Model  Take a Poisson-Lognormal model  y~Pois(ei θi)  θi=α+ x’β + ui  What if we introduce more interesting structure on u?  A common form of spatial correlation structure is the CAR model, or also called the Besag model  This is:
  • 82. More models  Another model that is commonly used in practice is the convolution model, or the Besag,York and Mollie model  y~Pois(ei θi)  θi=α+ x’β + ui + vi  Where now ui is a correlated heterogeneity (CH) term and vi is an Uncorrelated Heterogeneity (UH) term
  • 84. Why Use R?  R is free (So are GeoDA and QGIS)  R is available in the RDC (So are GeoDA and QGIS)  R is extremely capable and flexible (So is QGIS)  R is a scripting and statistical language (Neither GeoDA or QGIS are)  R is one stop shopping for many geospatial techniques  Spatial joins, projections, data merging, raster analysis, vector operations