This template is based on Esri Corporate Template v1.1, March 7, 2013
Rarely do we have prepared data Often requires some processing/analysis
Is it a hammer to crack a walnut?
Many traditionally form approaches known as EDA methods but when does exploration turn into discovery? Perhaps better termed exploratory and discovery analysisEasy to understand and explainTend to represent central values rather than extremesTend to be aspatial but there are also a number of spatial approaches that have valid uses
There are a numberof methods we can use to describe our data and a number of different tools we can use in ArcGIS. We can also calculate statistical values on an attribute table for total (or sum), min, max, meanIt also gives us further valuable information such as the number of NULL values in our data. Remember that zero values are included in numerical calculations so shouldn’t be used in cases where, for example, there are missing data. The last three methods aremeasures of central tendency but theseare often not adequate to fully describe data. Two data sets can have the same mean but they can be entirely different. We can better understand databy with the extent of variability. This is given by the measures of dispersion. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion.Also have harmonic mean for average quantities such as speed: so that if you have x unit per y and the x's are knownHarmonic Mean is appropriate e.g. We want to find theaverage speed in Kilometers per hour and you know the kilometres (distance travelled), so the x, then use Harmonic mean. If the y's e.g. Hours (time- of journey) are given use arithmetic mean.
Range: is the difference between the largest and smallest. Simplest of all dispersion measures.Standard deviation: is the most commonly used measure of dispersion, average deviation from the mean. If the observations are from a normal distribution it is much simpler to understand as 68% of observations lie between mean ± 1 SD 95% of observations lie between mean ± 2 SD and 99.7% of observations lie between mean ± 3 SD.If your data is skewed or you have ordinal data then median and interquartile range should be used to measure dispersion.Another way of looking at Standard Deviation is by plotting the distribution as a histogram of responses. A distribution with a low SD would display as a tall narrow shape, while a large SD would be indicated by a wider shape. Quantiles: Divide the sample data into equal-sized subgroups of adjacent values or a probability distribution into distributions of equal probability.Five quantiles are called quintiles so that if we look at the values below the 1st quintile we have 20% of the data and the 1st quintile shows 20% of values, the 2nd quintile is 40% and so on. So we can see that the Median, as the central value is the 50th percentile. Quartiles is dividing the data into 4 parts, often shown as interquartile range. So, for example, the 75th percentile is the upper quartile.This measure gets around issues of having one or two really high/low values that distort the range, however, both range and quartiles only take into account a few values in the dataset.
Sometimes we want to describe the spatial distribution as apposed to the values at locations.And, there are a number of tools in ArcGIS that allow you describe your data spatially.The median center in this case represents the centre of minimum travel, as opposed to median in the aspatial sense. The name for this spatial statistic differs by county as some countries e.g. the UK define the median center as being analogous to the median of a set of data.So, let’s have a look at some examples of basic descriptions before we move on…
In aspatial terms:Adjusting values measured on different scales to a notionally common scale allows you to draw comparisons between variables that are not the same e.g. different units. This can be a valuable approach in geographical analysis when we often use proxy or indicator variables to effectively ‘fill in gaps’ when we do not have data.The standard score is the number of standard deviations an observation is above or below the mean. A positive standard score represents a valueabove the mean, while a negative standard score represents a value below the mean. Found by subtracting the population mean from each value and then dividing the difference by the standard deviation of the whole dataset.Not to be confused with the concept of ‘’normal distribution.We must also consider these same statistical principles when we work spatially and visualise our results:Commonly use rates and ratiosChoropleth maps should show normalized values not counts collected over unequal areas or populations
We can instinctively convert visual images into information about quantity and intensity. We can instinctively see that the glass on the right has half the amount of drink…we can make comparisons and gain information.What is the glass is different…it is no longer easy to judge.
Summary statistics, such as averages, medians, or percentages are already measures of intensity and should not be normalized.
Measures the intensity of a spatial point pattern based on sample observations. It creates a continuous surface showing the density of features or values irrespective of arbitrary administrative boundaries. Useful for estimating the intensity of one type of event relative to another e.g. disease cases compared to a control group.
Using test statistics we can show the strength of relationships between samplesParametricObservations should be independent and drawn from a population with a normal distributionPopulation is homoscedastic i.e. equal variancesResults of all inferential statistics are only valid if they are applied to random samplesFrom your data it may seem or be apparent that there are differences but when analyzing data we must be objective.Part of this approach means investigating a theory termed hypothesis testingStatistical tests can’t show us what is true but they can show us what is not true. So, when defining a hypothesis we are trying to disprove it. If I was to take the example of heights of men and women, my hypothesis would be that men and women are, on average, the same height. I would then use a statistical test to try to reject my hypothesis and reject the null hypothesis.If I sampled 3 men and 3 women I could find that women are taller than men but if I increase my sample size to a more representative number then I would likely have found the opposite and so be able to reject my hypothesis.The implication is that any relationship seen in a sample dataset could be a result of chance and a more complete set of samples may show a different relationship.Statisticians always start from the assumption that sample results are not representative of the whole.The probability that a hypothesis can be rejected is called the significance level. This should be defined before the test is carried out.The smaller sample size the harder it is to know if it is representative of the whole population (i.e. real situation). Degrees of freedom, in some way, represents the size of the sample.Significance is a statistical term that tells how sure you are that a difference or relationship exists. “Statistical significance" does not mean "significant" in the sense of "important.“ Statistical significance tells you if the relationship you observed in your sample is likely to hold up in the population. It, therefore, tells you if you can generalize your finding from your sample to your population.How much taller do men have to be than women to say that they are taller. This depends on the average heights, number of samples. This is what the p-value quantifies how sure I am that they are taller.
Demos: Export to excel and resources center?Test statistics give a single value so are not something we would want to map but may be something that you can use to support your analysis.A number of statistics tests are available in excel and we can use the new export to excel tool (at 10.2). There are also tools available for download on the resources center written by the analysis teamA more extensive suite of tests are available in the python library Scipy, which you can then use from the python command line or in your scripts.Land cover by watershed: Change in forested area between 2001 and 2006
The problem with the presence of spatial autocorrelation is that it corrupts standard statistical tests. So, there is real need for true spatial statistical tests.Spatial autocorrelation is determined by both similarities in position and by similarities in attributesSpatial autocorrelation that is more positive than expected from random indicate the clustering of similar values across geographic space, while significant negative spatial autocorrelation indicates that neighboring values are more dissimilar than expected by chance.In ArcGIS, for statistical hypothesis testing, Moran's I values are transformed to Z-scores in which values greater than 1.96 or smaller than −1.96 indicate spatial autocorrelation that is significant at the 5% level.
So what is spatial interpolation: Closer points should have less difference in value than points farther apart
8 SA tools and 15 in GA. We won’t be covering them all but hopefully I will cover enough of the key points that you can explore others on your own.
Input must represent the high and lows of values
Finds the closest subset of input samples to a query point and applies weights to them based on proportionate areas (from Voronoi/Theisson polygons) to interpolate a value.Local interpolator - uses a subset of samples that surround a query point, and interpolated values are always within the range of the samples used. The surface passes through the input samples and is smooth everywhere except at locations of the input samples.The proportion of overlap between theisson polygons (proximal solution) and an overlaid voronoi defines the weight.Not affected by data distribution unlike other distance IDW
Combines the ideas of proximity in theisson polygons and gradual change in trend surfaceExact interpolator - Input must represent the high and lows of values Assumes spatial autocorrelation in the dataPower parameter: controls the weighting by distance. Higher value gives nearest points more emphasis (surface will be less smooth). The optimal value is where the minimum mean absolute error is at its lowest.Same result from each extension given the same inputs
Placing the model through the points (i.e. finding the curve of best fit) gives us a measure of the spatially correlated random component. Semivariance is the measure of interdependence between the values, based on how close they are to each other.
With kriging in spatial analyst the data can not be transformed so it must be normally distributed. In Geostatistical analyst it can be transformed.Calculations use transformed data and then back transformation is done automaticallyNormal Score transformation: Fits a mixture of normal distributions to the data
Another assumption of many geostatistical techniques is that your data is stationary:Its statistical properties are independent of absolute location i.e. mean, variance, do not depent upon location. Covariance depends on only on the relative locations of the sites, the distance and direction between then and not their exact location.In a spatial or temporal context, such dependence is called autocorrelation.The statistical parameters (mean and standard deviation) of the process do not change over spaceA stationary process has the property that the mean, variance and autocorrelation structure do not change over space. Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuationsEBK can be effectively used with non-stationary data
Histogram: values in far removed bars to the left or right may indicate outliersQQ plot: values at the tails of a normal can also be outliersSemivariogram cloud - Shows the relationship between two points. Points close together have high differences in their values may be outliers
To reduce the number of points in the empirical semivariogram, the pairs of locations are grouped based on their distance from one another. This grouping process is known as binning.The lag size is the distance between the points in the bins.The default is selected using a reasonable rule of thumb based on the data extent but a more robust method is to use the average nearest neighbor tool (in the spatial statistics toolbox). This will help you find the average distance between points and their nearest neighbors. If your data is clustered, you might need to use a smaller lag size than the average nearest neighbor to obtain a more accurate measure for the nugget in the semivariogram.The nugget represents the smallest distance between points in the data and the shortest distance for which you can understand a relationship with distance and value.A easier approach is to use the optimise button. It will help fit the semivariogram model primarily focussing on the range parameter and is based on minimising the mean square error.
EBK makes multiple simulations of the semivariogram and we are looking for the median to fall within the 25th and 75th percentiles.Each location uses a weighted sum of the distributions.It creates different, local models across the area. You can overlap these models to create a smooth surface.AdvantagesRequires minimal interactive modeling Standard errors of prediction are more accurate than other kriging methods More accurate than other kriging methods for small or nonstationary datasetsDisadvantagesProcessing is slower than other kriging methodsLimited customization
Predictions should be unbiased and centered on the true values. If the prediction errors are unbiased, the MEAN PREDICTION ERROR error should be near zero.But this value depends on the scale of your data so the standardised mean (MEAN STANDARDIZED) is also reported. So, this should also be near zero.If the root mean squared standardized errors are >1 you are underestimating variability in your predictions and, if root mean squared standardized errors are < 1 you are overestimating.
Demos:When does it makes sense: think about what the data shows (totals with sample data, interpolation with emissions)How can you use descriptive statistics – comparisons with other (reference) areas.Spatial linear mean > anisotropy
Building maps with analysis
NACIS Annual Meeting
Oct 9-11, 2013 | Greenville, South Carolina
More than just colouring in: building
maps with a solid analytical foundation
Linda Beale PhD
Based on clear need and purpose
Who is the audience?
What is the intended purpose?
What medium is to be used?
These goals can not all accomplished by visual tricks
What does GIS offer cartography?
Is it a hammer to crack a walnut?
Combining different data to get new information
Bringing data together from disparate sources
Part of the process of making informative and different maps
Is statistical analysis really needed?
The development and application of methods to
collect, analyze and interpret data
The science of learning from data
Spatial analysis is about solving problems
What is inside an area?
What is nearby?
Where are the events concentrated?
Where do things move over time?
Why things occur where they do?
How can we estimate values for a whole area?
What is a suitable location for …?
• Maps are needed to communicate the result
Help understand the data as part of analysis or to quantify data
Commonly aspatial i.e. result is not dependant on location
Some spatial methods
Count or sum of
Summary Statistics / Spatial Join /
Frequency / Tabulate Intersection
Neighborhood & Zonal Statistics (Spatial
Statistics / Summary Statistics / Spatial
Neighborhood & Zonal Statistics / Get &
set raster properties (Spatial Analyst)
Neighborhood & Zonal Statistics (Spatial
Neighborhood & Zonal Statistics (Spatial
Statistics / Summary Statistics/ Spatial Join
Neighborhood & Zonal Statistics (Spatial
Average deviation about
Value that is nth way
through a sorted list
Summary Statistics / Spatial
Neighborhood & Zonal
Statistics (Spatial Analyst)
Neighborhood & Zonal
Statistics (Spatial Analyst)
Summarizing by area
Linear Directional Mean
Linear directional mean
Normalization is to transform a set of measurements so that
they may be compared in a meaningful way
Examples: Standard score (z values), coefficient of variation
Normalization transforms measures of magnitude (counts or
weights) into measures of intensity
Using normalization we can take into account the differences
between the areas (e.g. size of area, population size etc)
We see quantity related to size
Distributions and patterns
Density surfaces of count per unit area
Looking at concentrations of features
Seeing patterns of features
Hotspots, Heat maps
A density surface reflects the likelihood of an event
occurring in each cell (bivariate probability density
Maps can lie…so can statistics
Assumptions must be met for example, statistical tests are
Parametric: Data distribution assumptions must be met
Analysis often concerned with explaining differences
Statistical significance does not mean ‘important’
“Everything is related to everything else, but near things
are more related than distant things."
Spatial autocorrelation statistics evaluate the degree of
spatial dependency among observations
from Latin interpolates
Meaning: to estimate a value that lies between two other values
Interpolation is required when:
We have samples from something that is continuous
A discrete surface has a different resolution (or cell size) to that
Spatial interpolation is based on the notion that points
which are close together in space tend to have similar
attributes (Tobler’s First Law of Geography)
If the relationship between points and their values is
distance between points = isotropy
distance and direction = anisotropy
Interpolated values are reliable only to the extent that the
spatial dependence of the phenomenon can be assumed
Interpolation in ArcGIS
Spline with Barriers
Topo to Raster
Topo to Raster by File
Inverse distance weighted
Radial basis functions
Gaussian geostatistical simulation
Empirical Bayesian kriging
The data contains the full range of possible values
Things close to one another are more alike than those
The outcome is exactly known and based on the input
Weighted average technique based Voronoi
(in dotted lines) on top
Delauney triangulation: The geometric dual of
Voronoi i.e. natural bisection between voronoi which
reverses the face inclusion
IDW (inverse distance weighting)
Output is limited to the range of the values used to
Based on the assumption that the interpolating surface
should be influenced most by the nearby points and less
by the more distant points
Assumes the surface is driven by local variation
Weights assigned diminish with distance from the
Sample points should have an even distribution
Natural Neighbor and IDW
Uses the relationships between your data locations and
their values, assuming:
Data is normally distributed
- Data exhibits stationary (no local variation)
- Data has spatial autocorrelation
- Data is not clustered
simple kriging has declustering options
Data has no local trends
- local trends can be removed during interpolation
(and these trends are accounted for in the
Assumes that spatial variation can be decomposed into 3 main
Deterministic variation or trend/drift
Trend analysed by trend surface analysis techniques
Spatially correlated, random variation
Spatially correlated variation analysed by computing the
Spatially uncorrelated variation (noise)
Provides measures of the certainty or accuracy of the
A normal QQ plot (probability plot)
Mean ≈ Median
Skewness ≈ 0
Kurtosis ≈ 3
Transformations can be used to bring data
to a normal distribution
e.g. logarithms, box-cox, square root
Statistical properties of data (e.g. mean, variance) are
independent of absolute location
Covariance depends on only on the relative locations of
the sites (e.g. the distance and direction between them)
and not their exact location
Create a Voronoi map symbolized by:
Systematic changes in the mean of the data values
across the area of interest
Can be difficult to distinguish from autocorrelation and
Trend removal options
Dealing with outliers
Outliers statistically affect your data
They may be real and important or may be errors
(such as input errors)
Remove outliers from the modeling step
Use the full dataset for prediction
Shows the spatial autocorrelation of the measured sample
0.5 * average[(valuei – valuej)2]
Empirical Bayesian Kriging
Spatial relationships are modeled automatically
Results often better than interactive modeling
Uses local models to capture small scale effects
Doesn’t assume one model fits the entire data
Requires minimal interactive modeling
- Standard errors of prediction are more accurate
than other kriging methods
- More accurate than other kriging methods for small
or nonstationary datasets
Processing is slower than other kriging methods
- Limited customization
Selecting the best model
Predictions should be unbiased
Mean prediction error should be near zero (depends on
the scale of the data) so,
- standardised mean nearest to 0
Predictions should be close to known values
Small root mean prediction errors
Correctly assessing the variability:
average standard-error nearest the root-mean-square
- standardised root-mean-square prediction error nearest
Empirical Bayes Kriging
Take away points…
Good analysis is an important part of cartography
Even basic statistics can be powerful
Spatial data is more complex…
but often reveals so much more