Building maps with analysis


Published on

Presentation given at NACIS 2013 by Linda Beale

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This template is based on Esri Corporate Template v1.1, March 7, 2013
  • Rarely do we have prepared data Often requires some processing/analysis
  • Is it a hammer to crack a walnut?
  • Many traditionally form approaches known as EDA methods but when does exploration turn into discovery? Perhaps better termed exploratory and discovery analysisEasy to understand and explainTend to represent central values rather than extremesTend to be aspatial but there are also a number of spatial approaches that have valid uses
  • There are a numberof methods we can use to describe our data and a number of different tools we can use in ArcGIS. We can also calculate statistical values on an attribute table for total (or sum), min, max, meanIt also gives us further valuable information such as the number of NULL values in our data. Remember that zero values are included in numerical calculations so shouldn’t be used in cases where, for example, there are missing data. The last three methods aremeasures of central tendency but theseare often not adequate to fully describe data. Two data sets can have the same mean but they can be entirely different. We can better understand databy with the extent of variability. This is given by the measures of dispersion. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion.Also have harmonic mean for average quantities such as speed: so that if you have x unit per y and the x's are knownHarmonic Mean is appropriate e.g. We want to find theaverage speed in Kilometers per hour and you know the kilometres (distance travelled), so the x, then use Harmonic mean. If the y's e.g. Hours (time- of journey) are given use arithmetic mean.
  • Range: is the difference between the largest and smallest. Simplest of all dispersion measures.Standard deviation: is the most commonly used measure of dispersion, average deviation from the mean. If the observations are from a normal distribution it is much simpler to understand as 68% of observations lie between mean ± 1 SD 95% of observations lie between mean ± 2 SD and 99.7% of observations lie between mean ± 3 SD.If your data is skewed or you have ordinal data then median and interquartile range should be used to measure dispersion.Another way of looking at Standard Deviation is by plotting the distribution as a histogram of responses. A distribution with a low SD would display as a tall narrow shape, while a large SD would be indicated by a wider shape. Quantiles: Divide the sample data into equal-sized subgroups of adjacent values or a probability distribution into distributions of equal probability.Five quantiles are called quintiles so that if we look at the values below the 1st quintile we have 20% of the data and the 1st quintile shows 20% of values, the 2nd quintile is 40% and so on. So we can see that the Median, as the central value is the 50th percentile. Quartiles is dividing the data into 4 parts, often shown as interquartile range. So, for example, the 75th percentile is the upper quartile.This measure gets around issues of having one or two really high/low values that distort the range, however, both range and quartiles only take into account a few values in the dataset.
  • Sometimes we want to describe the spatial distribution as apposed to the values at locations.And, there are a number of tools in ArcGIS that allow you describe your data spatially.The median center in this case represents the centre of minimum travel, as opposed to median in the aspatial sense. The name for this spatial statistic differs by county as some countries e.g. the UK define the median center as being analogous to the median of a set of data.So, let’s have a look at some examples of basic descriptions before we move on…
  • In aspatial terms:Adjusting values measured on different scales to a notionally common scale allows you to draw comparisons between variables that are not the same e.g. different units. This can be a valuable approach in geographical analysis when we often use proxy or indicator variables to effectively ‘fill in gaps’ when we do not have data.The standard score is the number of standard deviations an observation is above or below the mean. A positive standard score represents a valueabove the mean, while a negative standard score represents a value below the mean. Found by subtracting the population mean from each value and then dividing the difference by the standard deviation of the whole dataset.Not to be confused with the concept of ‘’normal distribution.We must also consider these same statistical principles when we work spatially and visualise our results:Commonly use rates and ratiosChoropleth maps should show normalized values not counts collected over unequal areas or populations
  • We can instinctively convert visual images into information about quantity and intensity. We can instinctively see that the glass on the right has half the amount of drink…we can make comparisons and gain information.What is the glass is different…it is no longer easy to judge.
  • Summary statistics, such as averages, medians, or percentages are already measures of intensity and should not be normalized.
  • Measures the intensity of a spatial point pattern based on sample observations. It creates a continuous surface showing the density of features or values irrespective of arbitrary administrative boundaries. Useful for estimating the intensity of one type of event relative to another e.g. disease cases compared to a control group.
  • Using test statistics we can show the strength of relationships between samplesParametricObservations should be independent and drawn from a population with a normal distributionPopulation is homoscedastic i.e. equal variancesResults of all inferential statistics are only valid if they are applied to random samplesFrom your data it may seem or be apparent that there are differences but when analyzing data we must be objective.Part of this approach means investigating a theory termed hypothesis testingStatistical tests can’t show us what is true but they can show us what is not true. So, when defining a hypothesis we are trying to disprove it. If I was to take the example of heights of men and women, my hypothesis would be that men and women are, on average, the same height. I would then use a statistical test to try to reject my hypothesis and reject the null hypothesis.If I sampled 3 men and 3 women I could find that women are taller than men but if I increase my sample size to a more representative number then I would likely have found the opposite and so be able to reject my hypothesis.The implication is that any relationship seen in a sample dataset could be a result of chance and a more complete set of samples may show a different relationship.Statisticians always start from the assumption that sample results are not representative of the whole.The probability that a hypothesis can be rejected is called the significance level. This should be defined before the test is carried out.The smaller sample size the harder it is to know if it is representative of the whole population (i.e. real situation). Degrees of freedom, in some way, represents the size of the sample.Significance is a statistical term that tells how sure you are that a difference or relationship exists. “Statistical significance" does not mean "significant" in the sense of "important.“ Statistical significance tells you if the relationship you observed in your sample is likely to hold up in the population. It, therefore, tells you if you can generalize your finding from your sample to your population.How much taller do men have to be than women to say that they are taller. This depends on the average heights, number of samples. This is what the p-value quantifies how sure I am that they are taller.
  • Demos: Export to excel and resources center?Test statistics give a single value so are not something we would want to map but may be something that you can use to support your analysis.A number of statistics tests are available in excel and we can use the new export to excel tool (at 10.2). There are also tools available for download on the resources center written by the analysis teamA more extensive suite of tests are available in the python library Scipy, which you can then use from the python command line or in your scripts.Land cover by watershed: Change in forested area between 2001 and 2006
  • The problem with the presence of spatial autocorrelation is that it corrupts standard statistical tests. So, there is real need for true spatial statistical tests.Spatial autocorrelation is determined by both similarities in position and by similarities in attributesSpatial autocorrelation that is more positive than expected from random indicate the clustering of similar values across geographic space, while significant negative spatial autocorrelation indicates that neighboring values are more dissimilar than expected by chance.In ArcGIS, for statistical hypothesis testing, Moran's I values are transformed to Z-scores in which values greater than 1.96 or smaller than −1.96 indicate spatial autocorrelation that is significant at the 5% level.
  • So what is spatial interpolation: Closer points should have less difference in value than points farther apart
  • 8 SA tools and 15 in GA. We won’t be covering them all but hopefully I will cover enough of the key points that you can explore others on your own.
  • Input must represent the high and lows of values
  • Finds the closest subset of input samples to a query point and applies weights to them based on proportionate areas (from Voronoi/Theisson polygons) to interpolate a value.Local interpolator - uses a subset of samples that surround a query point, and interpolated values are always within the range of the samples used. The surface passes through the input samples and is smooth everywhere except at locations of the input samples.The proportion of overlap between theisson polygons (proximal solution) and an overlaid voronoi defines the weight.Not affected by data distribution unlike other distance IDW
  • Combines the ideas of proximity in theisson polygons and gradual change in trend surfaceExact interpolator - Input must represent the high and lows of values Assumes spatial autocorrelation in the dataPower parameter: controls the weighting by distance. Higher value gives nearest points more emphasis (surface will be less smooth). The optimal value is where the minimum mean absolute error is at its lowest.Same result from each extension given the same inputs
  • Placing the model through the points (i.e. finding the curve of best fit) gives us a measure of the spatially correlated random component. Semivariance is the measure of interdependence between the values, based on how close they are to each other.
  • With kriging in spatial analyst the data can not be transformed so it must be normally distributed. In Geostatistical analyst it can be transformed.Calculations use transformed data and then back transformation is done automaticallyNormal Score transformation: Fits a mixture of normal distributions to the data
  • Another assumption of many geostatistical techniques is that your data is stationary:Its statistical properties are independent of absolute location i.e. mean, variance, do not depent upon location. Covariance depends on only on the relative locations of the sites, the distance and direction between then and not their exact location.In a spatial or temporal context, such dependence is called autocorrelation.The statistical parameters (mean and standard deviation) of the process do not change over spaceA stationary process has the property that the mean, variance and autocorrelation structure do not change over space. Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuationsEBK can be effectively used with non-stationary data
  • Histogram: values in far removed bars to the left or right may indicate outliersQQ plot: values at the tails of a normal can also be outliersSemivariogram cloud - Shows the relationship between two points. Points close together have high differences in their values may be outliers
  • To reduce the number of points in the empirical semivariogram, the pairs of locations are grouped based on their distance from one another. This grouping process is known as binning.The lag size is the distance between the points in the bins.The default is selected using a reasonable rule of thumb based on the data extent but a more robust method is to use the average nearest neighbor tool (in the spatial statistics toolbox). This will help you find the average distance between points and their nearest neighbors. If your data is clustered, you might need to use a smaller lag size than the average nearest neighbor to obtain a more accurate measure for the nugget in the semivariogram.The nugget represents the smallest distance between points in the data and the shortest distance for which you can understand a relationship with distance and value.A easier approach is to use the optimise button. It will help fit the semivariogram model primarily focussing on the range parameter and is based on minimising the mean square error.
  • EBK makes multiple simulations of the semivariogram and we are looking for the median to fall within the 25th and 75th percentiles.Each location uses a weighted sum of the distributions.It creates different, local models across the area. You can overlap these models to create a smooth surface.AdvantagesRequires minimal interactive modeling Standard errors of prediction are more accurate than other kriging methods More accurate than other kriging methods for small or nonstationary datasetsDisadvantagesProcessing is slower than other kriging methodsLimited customization
  • Predictions should be unbiased and centered on the true values. If the prediction errors are unbiased, the MEAN PREDICTION ERROR error should be near zero.But this value depends on the scale of your data so the standardised mean (MEAN STANDARDIZED) is also reported. So, this should also be near zero.If the root mean squared standardized errors are >1 you are underestimating variability in your predictions and, if root mean squared standardized errors are < 1 you are overestimating.
  • Demos:When does it makes sense: think about what the data shows (totals with sample data, interpolation with emissions)How can you use descriptive statistics – comparisons with other (reference) areas.Spatial linear mean > anisotropy
  • Building maps with analysis

    1. 1. NACIS Annual Meeting Oct 9-11, 2013 | Greenville, South Carolina Workshop More than just colouring in: building maps with a solid analytical foundation Linda Beale PhD
    2. 2. Visualization process • Based on clear need and purpose - Who is the audience? - • What is the intended purpose? What medium is to be used? These goals can not all accomplished by visual tricks
    3. 3. What does GIS offer cartography? • Is it a hammer to crack a walnut? - Combining different data to get new information - • Bringing data together from disparate sources Part of the process of making informative and different maps Is statistical analysis really needed? - The development and application of methods to collect, analyze and interpret data - The science of learning from data
    4. 4. Spatial analysis is about solving problems • • • • • • • What is inside an area? What is nearby? Where are the events concentrated? Where do things move over time? Why things occur where they do? How can we estimate values for a whole area? What is a suitable location for …? • Maps are needed to communicate the result
    5. 5. Getting clarity • Descriptive statistics - Help understand the data as part of analysis or to quantify data - Commonly aspatial i.e. result is not dependant on location - Some spatial methods
    6. 6. Basic descriptors Method Use ArcGIS tools Total Count or sum of values  Smallest and largest values  Minimum, Maximum    Mode Most commonly occurring  Median Central value     Mean Average value   Summary Statistics / Spatial Join / Frequency / Tabulate Intersection Neighborhood & Zonal Statistics (Spatial Analyst) Statistics / Summary Statistics / Spatial Join Histogram (Geostatistics) Neighborhood & Zonal Statistics / Get & set raster properties (Spatial Analyst) Spatial Join Neighborhood & Zonal Statistics (Spatial Analyst) Spatial Join Histogram (Geostatistics) Neighborhood & Zonal Statistics (Spatial Analyst) Statistics / Summary Statistics/ Spatial Join Neighborhood & Zonal Statistics (Spatial Analyst)
    7. 7. Data distributions Method Use ArcGIS Range Max-min   Standard deviation Average deviation about the mean  nth Quantile Value that is nth way through a sorted list    Summary Statistics / Spatial Join Neighborhood & Zonal Statistics (Spatial Analyst) Summary Statistics Neighborhood & Zonal Statistics (Spatial Analyst) Display Properties Histogram (Geostatistics)
    8. 8. Demo Finding quantity by area Summarizing by area
    9. 9. Demo Finding percentage area Tabulate Intersection
    10. 10. Spatial descriptors Method ArcGIS tools Mean  Central value  Distribution     Mean Center Linear Directional Mean Central Feature Median Center Standard Distance Directional Distribution
    11. 11. Demo Finding direction Linear directional mean
    12. 12. Normalization • Aspatially: - Normalization is to transform a set of measurements so that they may be compared in a meaningful way - • Examples: Standard score (z values), coefficient of variation Spatially: - Normalization transforms measures of magnitude (counts or weights) into measures of intensity - Using normalization we can take into account the differences between the areas (e.g. size of area, population size etc)
    13. 13. Understanding quantity • We see quantity related to size
    14. 14. As we often see it….
    15. 15. Or… • So, we must map ‘like’ with ‘like’
    16. 16. Demo Normalization
    17. 17. Distributions and patterns • Density surfaces of count per unit area - Looking at concentrations of features - Seeing patterns of features - • Hotspots, Heat maps A density surface reflects the likelihood of an event occurring in each cell (bivariate probability density function)
    18. 18. Demo Showing distribution Density analysis
    19. 19. Maps can lie…so can statistics • Assumptions must be met for example, statistical tests are either: - • Parametric: Data distribution assumptions must be met Non-parametric: Distribution-free Analysis often concerned with explaining differences - Hypothesis testing - Statistical significance does not mean ‘important’
    20. 20. Demo Identifying Clusters Hotspot analysis
    21. 21. Demo Temporal patterns Coxcombs or Rose diagrams
    22. 22. Demo Comparisons Percentage Difference
    23. 23. Spatial autocorrelation • “Everything is related to everything else, but near things are more related than distant things." Tobler (1970) • Spatial autocorrelation statistics evaluate the degree of spatial dependency among observations
    24. 24. Interpolation • from Latin interpolates - • Meaning: to estimate a value that lies between two other values Interpolation is required when: - We have samples from something that is continuous - A discrete surface has a different resolution (or cell size) to that required
    25. 25. Spatial interpolation • Spatial interpolation is based on the notion that points which are close together in space tend to have similar attributes (Tobler’s First Law of Geography) • If the relationship between points and their values is determined by: - distance between points = isotropy - distance and direction = anisotropy Interpolated values are reliable only to the extent that the spatial dependence of the phenomenon can be assumed
    26. 26. Interpolation in ArcGIS • IDW • Kriging • Natural Neighbor • Spline • Spline with Barriers • Topo to Raster • Topo to Raster by File • Trend • Global polynomial • Local polynomial • Inverse distance weighted • Radial basis functions • Diffusion kernel • Kernel smoothing • Ordinary kriging • Simple kriging • Universal kriging • Indicator kriging • Probability kriging • Disjunctive kriging • Gaussian geostatistical simulation • Areal interpolation • Empirical Bayesian kriging
    27. 27. Deterministic methods • The data contains the full range of possible values • Things close to one another are more alike than those farther apart • The outcome is exactly known and based on the input
    28. 28. Natural neighbor • Weighted average technique based Voronoi Delauney triangulation (in dotted lines) on top of voronoi • Delauney triangulation: The geometric dual of Voronoi i.e. natural bisection between voronoi which reverses the face inclusion
    29. 29. IDW (inverse distance weighting) • Output is limited to the range of the values used to interpolate • Based on the assumption that the interpolating surface should be influenced most by the nearby points and less by the more distant points - • Assumes the surface is driven by local variation Weights assigned diminish with distance from the interpolation point - Sample points should have an even distribution
    30. 30. Demo Interpolation to area Natural Neighbor and IDW
    31. 31. Geostatistical methods • Uses the relationships between your data locations and their values, assuming: Data is normally distributed - Data exhibits stationary (no local variation) - Data has spatial autocorrelation - Data is not clustered - - - simple kriging has declustering options Data has no local trends - local trends can be removed during interpolation (and these trends are accounted for in the prediction calculations)
    32. 32. Kriging Assumes that spatial variation can be decomposed into 3 main components: 1. Deterministic variation or trend/drift Trend analysed by trend surface analysis techniques 2. Spatially correlated, random variation Spatially correlated variation analysed by computing the semivariance 3. Spatially uncorrelated variation (noise) Provides measures of the certainty or accuracy of the predictions
    33. 33. Normal distribution - Histogram - A normal QQ plot (probability plot) - Bell-shaped - No outliers - Mean ≈ Median - Skewness ≈ 0 - Kurtosis ≈ 3
    34. 34. Transformations • Transformations can be used to bring data to a normal distribution e.g. logarithms, box-cox, square root
    35. 35. Data stationarity • Statistical properties of data (e.g. mean, variance) are independent of absolute location • Covariance depends on only on the relative locations of the sites (e.g. the distance and direction between them) and not their exact location • Create a Voronoi map symbolized by: • Entropy • Standard Deviation
    36. 36. Trend • Systematic changes in the mean of the data values across the area of interest - Can be difficult to distinguish from autocorrelation and anisotropy - Trend removal options
    37. 37. Dealing with outliers Outliers statistically affect your data • They may be real and important or may be errors (such as input errors) Possible solution: • Remove outliers from the modeling step (semivariogram) • Use the full dataset for prediction
    38. 38. The semivariogram Shows the spatial autocorrelation of the measured sample points semivariance • sill partial sill range nugget 0 • lag Semivariogram(distanceh) = 0.5 * average[(valuei – valuej)2]
    39. 39. Empirical Bayesian Kriging • Spatial relationships are modeled automatically • Results often better than interactive modeling • Uses local models to capture small scale effects - Doesn’t assume one model fits the entire data
    40. 40. Using EBK • Advantages Requires minimal interactive modeling - Standard errors of prediction are more accurate than other kriging methods - More accurate than other kriging methods for small or nonstationary datasets - • Disadvantages Processing is slower than other kriging methods - Limited customization -
    41. 41. Selecting the best model • Predictions should be unbiased Mean prediction error should be near zero (depends on the scale of the data) so, - standardised mean nearest to 0 - • Predictions should be close to known values - • Small root mean prediction errors Correctly assessing the variability: average standard-error nearest the root-mean-square prediction error - standardised root-mean-square prediction error nearest to 1 -
    42. 42. Demo Interpolation to area Empirical Bayes Kriging
    43. 43. Take away points… • Good analysis is an important part of cartography • Even basic statistics can be powerful • Spatial data is more complex… but often reveals so much more
    44. 44. Demo Think before you map…