The modulesModule1 makes the case for knowingabout statistics as a transferable skilland to be equipped for social andpolitical debate.Module 2 is about using descriptivestatistics and simple graphicaltechniques to explore and makesense of data.Module 3 discusses the Normalcurve, the properties of whichprovide the basis for inferential
The modulesModule 4 is about the principles ofresearch design and effective datacollection.Module 6 discusses the role ofhypothesis testing.Module 7 is about regressionanalysis.
The modulesModule 8 moves to modelling pointpatterns, ―hotspot analysis‖ and waysof measuring patterns of spatialautocorrelation in data.Module 9 looks at spatial regressionmodels, geographically weightedregression and multilevel modelling.Each module is explored more fullyin the accompanyingtextbook, Statistics for Geographyand Environmental Science.
Module 1(Extracts from Chapter 1 of Statistics for Geographyand Environmental Science)DATA, STATISTICS ANDGEOGRAPHY
Module overviewTo convince you that studyingstatistics is a good idea!Our argument is that data collectionand analysis are central to thefunctioning of contemporary societyso knowledge of quantitativemethods is a necessary skill tocontribute to social and scientificdebate.
About statisticsStatistics are a reflective practice: away of approaching research thatrequires a clear and manageableresearch question to be formulated, ameans to answer that question,knowledge of the assumptions ofeach test used, an understanding ofthe consequences of violating thoseassumptions, and awareness of theresearcher‘s own prejudices whendoing the research.
Some reasons to study statisticsReasons for human geographers – Data collection and analysis are central to the functioning of society, to systems of governance and science. – Knowledge of statistics is an entry into debate, informed critique and the possibility of creating change.
Some reasons to study statisticsReasons for GI scientists – To address the uncertainties and ambiguities of using data analytical. – Because of the increased integration of mapping capabilities, data visualizations and (geo-) statistical analysis.
Some reasons to study statisticsReasons for all students – They provide a transferable skill set using in other areas of research, study and employment. – There is a recognised shortage of students with skills in quantitative methods, especially within the social sciences.
Types of statisticDescriptive– Used to provide a summary of a set of measurements, e.g. the average.Inferential– Use the data at hand to convey information about the population (‗the greater something‘) from which the data are drawn.Relational– Consider whether greater or lesser values in one set of data are related to greater or lesser values in another.
Geographical dataThese are records of what hashappened at some location on theEarth‘s surface and where.For many statistical tests the whereis largely ignored.However, it is central to geostatisticsand to spatial statistics (as theirnames suggest)
Some problems when analysing geographical dataStandard statistical tests assume thateach ‗bit‘ of data (each observation)has a value that is not influenced byany other.However, we may often expect thereto be geographical patterns in thedata.– Spatial autocorrelation: geographical patterns in the measurements
Some problems when analysing geographical dataDetermining what causes what in acomplex and dynamic natural orsocial system is extremely tricky.Two things may be associated (e.g.greater income inequality and morenon-recycled waste) without the onedirectly causing the other.
Some problems when analysing geographical dataData and structured forms of enquirycan only tell us so much and may notbe appropriate to some types ofresearch for which a morequalitative, participatory or lessrepresentational approach may bebetter.
Further readingChapter 1 of Statistics forGeography and EnvironmentalScience by Richard Harris and ClaireJarvis (Prentice Hall / Pearson, 2011)Includes a review of the followingkey concepts: types of statistics;why error is unavoidable;geographical data analysis; andspatial autocorrelation and the firstlaw of geography.
Module 2(Extracts from Chapter 2 of Statistics for Geographyand Environmental Science)DESCRIPTIVE STATISTICS
Module overviewThis module is about ―everydaystatistics‖, the sort that summarisedata and describe them in simpleways.They include the number of homeruns this season, average maleearnings, numbers unemployed,outside temperature, average cost ofa barrel of oil, regional variations incrime rates, pollution statistics,measures of the economy and other―facts and figures‖
Data and variablesData– A collection of observations: measurements made of something.A variable– Another name for a collection of data. Variable because it is unlikely that the data are all the same.Data types– These include discrete, continuous, and categorical data.
Simple ways of presenting dataDiscrete data Continuous dataFrequency table Summary tableBar chart (below) Histogram (below, with a rug plot)
Information to include in a summary tableMeasures of central tendency(―averages‖)– The mean and/or median • The ―centre‖ of the dataMeasures of spread and variation– The range (minimum to maximum)– The interquartile range (from ‗mid- spread‘ of the data)– The standard deviation,s
More about the standard deviation Essentially a measure of average variation around the mean. It is also the square root of the variance. The variance is the sum of squares divided by the degrees of freedom
BoxplotsAre useful forshowing themedian,interquartilerange and rangeof a set of data,for indentifyingoutliers and alsofor comparingvariables.
Other ways of classifying numeric data Nominal, ordinal, interval and ratio Counts and rates Proportions and percentages Parametric and non—parametric Arithmetic and geometric Primary and secondary
Further readingChapter 2 of Statistics for Geographyand Environmental Science by RichardHarris and Claire Jarvis (Prentice Hall /Pearson, 2011)Includes a review of the following keyconcepts: data and variables; discreteand continuous data; the range;histograms, rug plots, and stem andleaf plots; measures of centraltendency; why averages can bemisleading; quantiles; the sum ofsquares; degrees of freedom; thestandard deviation and the variance;box plots; and five and six numbersummaries
Module 3(Extracts from Chapter 3 of Statistics for Geographyand Environmental Science)THE NORMAL CURVE
Module overviewThis module introduces the normalcurve, so called because it is howmany social and scientific dataappear distributed.
The normal curveIt is also known asthe Gaussiandistribution and isoften described as‗bell-shaped‘It is a family ofdistributions all ofwhich have thesame probabilitydensity function(the same formuladefining theirshape).
The central limit theoremThe central limit theorem states thatthe sum (and therefore average) of alarge number of independent andidentically distributed randomvariables will approach a normaldistribution as the sample sizeincreases, even if the variables arenot themselves normally distributed.
Properties of a normal curveRanges fromnegative to positiveinfinityIs symmetricalaround its mean95% of the areaunder the curve iswithin 1.96standarddeviations of themean99% of the area iswithin 2.58standarddeviations.
Properties of a normal curveConsequently, if adata set isapproximatelyNormal, theprobability ofselecting, at random,an observation atthat is within 1.96standard deviationsof the mean is p =0.95, and theprobability it will bewithin 2.58 standarddeviations is p=0.99.
Standardising data (z values)Data arestandardised if theiroriginalmeasurement unitsare replaced withunits of standarddeviation from themean (z values).It is a little likeconverting aproportion (0 to 1) toa percentage (0 to100): it doesn‘tchange the shape ofthe data.
Standardising data (z values)The z values are calculated bysubtracting the mean of the data fromeach observation and then dividing bythe standard deviation.Once data are standardised andassuming they are approximatelynormal then they can be comparedagainst the Standard Normal curve.This is a special instance of a normalcurve that has a mean of zero and astandard deviation of one.It provides a model or benchmark forthe data.
Probability and the Standard NormalThe area between two z values(under the Standard Normal) is theprobability of selecting anobservation randomly from the datathat will have a z value betweenthose two values.That area can be determined using astatistical table or equivalent.
Probability and the Standard NormalSee the workedexamples on pp.62–70 ofStatistics forGeography andEnvironmentalScience
Some data are skewed but can often be transformed to approximate normality
The quantile plotUseful (and betterthan a histogram)to check for non-normality, such asskew and thepresence ofoutliers.If the data werenormal they‘d bedistributed alongthe straight line.
Further readingChapter 3 of Statistics for Geographyand Environmental Science by RichardHarris and Claire Jarvis (Prentice Hall /Pearson, 2011)Includes a review of the following keyconcepts: properties of normal curves;the central limit theorem; probabilityand the normal curve; finding the areaunder the normal curve; skewed dataand the ‗ladder of transformation‘;moments of a distribution; and thequantile plot.
Module 4(Extracts from Chapter 4 of Statistics for Geographyand Environmental Science)SAMPLING
Module overviewIt is rarely possible or necessary tocollect all possible data aboutsomething that is being studied.This module is about how to goabout collecting a sample of data thatis fit for a particular research task.
SamplingIt is common in geographical andother research to gather a sample(or subset) of data from a targetpopulation.The aim is for the sample to berepresentative of that population.Sampling bias occurs when thesample favour some parts of thetarget population more thanothers, perhaps by sampling at anunrepresentative time or place orbecause of the data collection
The process of samplingDefine the research questionReview the related literatureReview the scope of the plannedstudyConstruct a sample frameSelect a sample design methodReview the design frompractical, ethical, safety and logisticalperspectivesImplement the design and collect the
Sampling methods Sampling methods Non-probabilistic Probabilistic sampling sampling methods methodsJudgemental Convenience Simple Systematic random Quota Snowball Clustered Stratified random random
Sampling methodsThe different methods are outlined on pp.94-105 of Statistics for Geography andEnvironmental Science.In general, random sampling methods arepreferred because the errors in the datashould be random too.However, a random sample won‘tnecessarily offer a wide enough coverageof the target population.Therefore stratified samples may be usedwhich may themselves targetspecific, representative places to reducethe cost and ease the logistics of the datacollection.
Sampling error and sample sizeThe impression that is formed of the targetpopulation depends on the sample of datataken to represent it.It is possible that a random sampleaccidentally misrepresents the populationif it happens only to observe its mostunusual occurrences: it is susceptible tosampling error.The larger the sample (the moreobservations there are) there smaller theerror is expected to be but with‗diminishing returns‘– the error is generally proportional to the square root of the sample size
Sampling error and sample sizeThe error is also a function of howmuch the target population varies– If it were exactly the same, everywhere, it wouldn‘t matter where the samples were takenA larger sample is costly and more timeconsuming to collect.However, a small sample of a highlyvariable population is unlikely togenerate any statistically meaningfulanalytical results.
Sampling methods: issues and practicalitiesPersonal safety, gaining permissionfrom an ethics committee, what to doabout missing data.Practical considerations– Weight and/or volume of the sample, import/export restrictions, analytical costsInstrument accuracy and scaleBottom line: if your sample is nogood, your analysis won‘t be anygood either.
Further readingChapter 4 of Statistics for Geographyand Environmental Science by RichardHarris and Claire Jarvis (Prentice Hall /Pearson, 2011)Includes a review of the following keyconcepts: the target population;representative samples; samplingframes; sampling bias; metadata;fitness for purpose and use; sampledesign; sampling error and sample size;sample size and replicates; andmeasurement accuracy.
Module 5(Extracts from Chapter 5 of Statistics for Geographyand Environmental Science)FROM DESCRIPTION TOINFERENCE
Module overviewThis module is about inference.Inference is at the heart of how andwhy statistics developed.It moves beyond simply summarisingdata (the sample) to using thosesummaries to gain insights into theunderlying system, process orstructure that the data aremeasurements of (the population).
A populationIts meaning isn‘t restricted to ―everyonewho lives in a particular place‖ but canbe much more abstract.– ―Every possible object (or entity) from which the sample is selected.‖– ―The complete set of all possible measurements that might hypothetically be recorded.‖Informally: the complete ‗thing‘ that youare interested in study but which can‘tbe measured in its entirety.
Each sample changes ourimpression of the population
The sample mean and the population meanAssuming the sample isrepresentative (unbiased), it ispossible to estimate the true mean ofthe population from the mean for thesample.– The population mean from the sample meanBut that estimate is sampledependent– Change the sample and you get a different estimate
Confidence intervalsIt is improbable that the samplemean is exactly equal to thepopulation mean– And we wouldn‘t know even if it was. • (unless we sampled the population in its entirety, in which case they‘d be no need to make an estimate!)However, we can place a confidenceinterval around the sample mean andestimate the probability thatconfidence interval contains thepopulation mean.
The width of a confidence interval The confidence interval is wider – The greater the probability you want it to contain the unknown population mean – The more variable the data are (the greater their variance / standard deviation) – The less data you have
The standard error (of the mean)The standard deviation of the datadivided by the square root of thenumber of observations gives anestimate of the standard error (ofthe mean) and is a measure ofuncertainty in the data – The greater the standard error, the greater the uncertainty
Why confidence intervals ‘work’In principle, if apopulation weresampled a very largenumber of times, thesample meancalculated in eachcase and those meansthen collected togetherto form a newvariable, we‘d find thatvariable to be normallydistributed, centred onthe population meanand with a standarddeviation equal to thestandard error of themean.
Small samplesFor small samples theconfidence interval willbe underestimated if it iscalculated with referenceto a normal distribution.A t-distribution is usedinstead.This is ‗fatter‘ than thenormal.Intuitively: we are morecautious with smallsamples that contain littleinformation. Theconfidence intervals arewidened to reflect thatcaution.
SummaryMean of the sample KnownStandard deviation of the sample KnownStandard deviation of the Unknown but approximated bypopulation the standard deviation of the sampleStandard error of the mean Estimated as the standard deviation of the sample divided by the square root of the sample sizeMean of the population Unknown but we can estimate the probability that it has a value that lies within a given number of standard errors either side of the sample mean (within a given confidence interval)
Further readingChapter 5 of Statistics forGeography and EnvironmentalScience by Richard Harris and ClaireJarvis (Prentice Hall / Pearson, 2011)Includes a review of the followingkey concepts: inference, samplesand populations; the distribution ofthe sample means; standard error ofthe mean; confidence intervals; the t-distribution and confidence intervalsfor ‗small samples‘
Module 6(Extracts from Chapter 6 of Statistics for Geographyand Environmental Science)HYPOTHESIS TESTING
Module overviewThis module introduces hypothesistesting as a way of formallyquestioning whether a populationmean could plausibly be equal to ahypothesised value, and to considerwhether two or more samples of datawere most probably drawn from thesame population.
The process of hypothesis testing Define the null hypothesis Define the alternative hypothesis Specify an alpha value – The maximum probability of rejecting the null hypothesis when it is, in fact, correct. Calculate the test statistic Compare the test statistic with a critical value Reject the null hypothesis is the test statistic has greater magnitude than the critical value
One-sample t testThe one-sample t test measures thenumber of standard errors the samplemean is from an hypothesised value.The further it is, the less probably thesample is drawn from a population with amean equal to that hypothesised value.The p value records that probability.A p value of 0.95 or greater means we canbe (at least) ―95% confident‖ that the ―truemean‖ (the population mean) for whateverhas been measured is not thehypothesised value.A p value of 0.99 or greater gives 99%confidence
Two-sample t testConsiders the probability that two samplesof data do not have the same populationmean.– If they don‘t, it suggests the samples measure categorically different things.It works by measuring the differencebetween the sample means relative to thevariance of the samples.There are different versions of the ttest, for example for paired data and forwhether the two samples haveapproximately equal variance or not.An F test is used to compare the samplevariances and see if any difference couldbe due to chance.
Analysis of variance (ANOVA)Used to test whetherthree or more groupsof data have thesame populationmean.Considers thevariations betweengroups relative to thevariation withingroups.Contrasts can beused to specificallycontrast one or moreof the groups withone or more of the
Two- and one-tailed testsA two-tailed test isnon-directionalwhereas a one-tailedtest is directional.Consider a one-sample t test– The alternative hypothesis for a two- tailed test is only that the population mean is not equal to the hypothesised value.– A one-tailed test it specifies which is the greater
Non-parametric testsNon-parametrictests do not begin Parametric Non-with fixed parametricassumptions about Two-samplet Wilcoxon rankhow the data and test sum test (aka Mann-Whitneythe population are test)distributed ANOVA Kruskal-Wallis– E.g. a normal test distributionHowever, if theassumptions aremet, it is better touse the parametrictest.
PowerWe worry about limiting the probability ofrejecting the null hypothesis when it iscorrect (of making a wrong decision)– Of having a low p valueBut we could avoid the error by neverrejecting the null hypothesisExcept, that‘s daft because the nullhypothesis could be wrong.So, also need to think about the probabilityof rejecting the null hypothesis when it isindeed wrong– The probability of making this, the right decision, is the power of the test.
Further readingChapter 6 of Statistics for Geographyand Environmental Science by RichardHarris and Claire Jarvis (Prentice Hall /Pearson, 2011)Includes a review of the following keyconcepts: type I errors; the one-sample t test; hypothesis testing; two-and one-tailed tests; type II errors andstatistical power;homoscedasticity, heteroscedasticityand the F test; analysis of variance;measuring effects; and parametric andnon-parametric tests.
Module 7(Extracts from Chapter 7 of Statistics for Geographyand Environmental Science)RELATIONSHIPS ANDEXPLANATIONS
Module overviewThis module looks at relationalstatistics, exploring whether highervalues in one variable are associatedwith higher values in another (apositive relationship) or whetherhigher values in the one areassociated with lesser values in theother (a negative relationship).It also looks at trying to explain thevariation found in one variable usingothers.
Scatter plotsScatter plots arean effective wayof seeing if thereis anyrelationshipbetween twovariables, whether it is a straightlinerelationship, andto help detecterrors in the data.
A positive relationship is when the lineof best fit is upwards sloping.A negative relationship is when it isdownwards sloping.The X variable (horizontal axis) is theindependent variable.The Y variable (vertical axis) is thedependent variable.It is assumed that the X variable leadsto, possibly even causes, the Yvariable.
Correlation coefficientsA correlationcoefficientdescribes thedegree ofassociationbetween two setsof paired values.The Pearsoncorrelationmeasures thestrength of thestraight linerelationship of twovariables.
Uses of regressionTo summarise dataTo make predictionsTo explain what causes what
Bivariate regressionBivariateregression finds aline of best fit tosummarise therelationshipbetween twovariables.That line can beused to makepredictions forwhat the Y valuewould be for agiven value for X.It is a line of bestfit, rarely perfect fit.
Regression tablesThe strength of the effect of the X variableon the Y is measured by the gradient ofthe line of best fit– It measures whether a change in X will lead to a change in Y and by how much.We have greater confidence that the effectis genuine and not a chance property ofthe sample the better the line fits the data(i.e. the less the residual variation aroundit).Regression tables report various measuresand diagnostics including the measuredgradient of the line, the residual error, theprobability the gradient could actually bezero (no relationship) and goodness-of-fitmeasures
Assumptions of regression analysisThere are varioustypes of regressionanalysis but the mostcommon, OrdinaryLeast Squaresregression, assumesthat the two variablesare linearly related (orcould be transformedto be so) and that theresidual errors arerandom with nounexplained patterns.Visual checks caneasily be made.Watch out forleveragepoints, extremeoutliers and
Multiple regressionWhen two or more X variables areused to explain the Y variable.In addition to the usual checks (oflinearity and of random errors) needto check also for multicollinearityIt is often helpful to standardize thevariables so their effects can becompared
A strategy for multiple regression Crawley (2005; 2007) describes the aim of statistical modelling as finding a minimal adequate model. The process involves going from a ―maximal model‖ containing all the variables of interest to a simpler model that fits the data almost as well by deleting the least significant variables one at a time (and checking the impact on the model at each stage of doing so). As part of the process, consideration also needs to be given to outliers and to other checks that the regression assumptions are being met.
Further readingChapter 7 of Statistics for Geographyand Environmental Science by RichardHarris and Claire Jarvis (Prentice Hall /Pearson, 2011)Includes a review of the following keyconcepts: scatter plots; independentand dependent variables; Pearson‘scorrelation coefficient; the equation of astraight line; residuals; bivariateregression; outliers and leveragepoints; multiple regression; goodness-of-fit measures; assumptions of OLSregression; and Occam‘s Razor and theminimal adequate model.
Module 8(Extracts from Chapter 8 of Statistics for Geographyand Environmental Science)DETECTING & MANAGINGSPATIAL DEPENDENCY
Module overviewThis module looks at some of thespecifically geographical issues ofanalysing data.
The ecological fallacyIn a general sense– Means that statistical Scale n r relationships found Region 9 -0.95 at one scale may LA 376 -0.77 not apply at ward 8868 -0.55 another scaleA more specificmeaning– When inappropriate assumptions are made about individuals from using grouped
Spatial autocorrelationStandard statistics assume theobservations / errors areindependent of each otherBut spatial data tend to be moresimilar in value at nearby locationsthan those further away– This is positive spatial autocorrelationNegative spatial autocorrelation iswhen nearby measurements are‗opposite‘ to each other
Detecting spatial autocorrelationThe semi-variogramis usedto explore a dataset visually and toestimate how faryou need to moveaway from aparticular datapoint before datapoints at thatdistance can beconsideredunassociated withthe first.
Other measures of global autocorrelationMoran‘s IGetis‘ G statisticGeary‘s CJoint counts method
Global Vs Local measuresA global measure of spatial autocorrelationgives a single summary measure of thepatterns of association for the whole studyregion.This can conceal more localised patternswithin the region.Global measures can often be ‗brokendown‘ into local measures where thepatterns of association are measured andcompared for sub-regions– E.g. Local Moran‘s I, Local Getis, G.Can be used to identify ‗hotspots‘ and ‗coldspots‘ of something (e.g. crime)
Further readingChapter 8 of Statistics forGeography and EnvironmentalScience by Richard Harris and ClaireJarvis (Prentice Hall / Pearson, 2011)Includes a review of the followingkey concepts: spatialautocorrelation; the MAUP; theecological fallacy; semi-variance;semi-variogram; common structuresused to model the semi-variogram;and hotspots.
Module 9(Extracts from Chapter 8 of Statistics for Geographyand Environmental Science)EXPLORING SPATIALRELATIONSHIPS
Module overviewThis module is about treating wheresomething happens as usefulinformation that may help explainwhat is happening. The central ideais when we find geographicalpatterns in data and there isevidence to suggest they did notarise by chance then it would bebetter to explore and model thecause of the patterns then to treatthem as an inconvenience.
Spatial regressionThe spatial error model and thespatially lagged y model areexamples of spatial regressionmodels that allow for and measurethe interdependencies betweenneighbouring or proximate data.Neighbourhoods are defined by aweights matrix indicating, forexample, if places share a boundary.
Multilevel modellingMultilevel modelling can be used tomodel at multiple scales simultaneouslyand to explore how individualbehaviours and characteristics areshaped by the places in which they liveor by the organisations they attend.Because multilevel models canconsider people in places they aresometimes used to generate evidenceof a neighbourhood effect.Also useful for longitudinal analysis(analysis over time)
Geography, computation and statisticsThe development of spatial analysishas been made possible byadvances in computation.But techniques like GWR arecharacterised by repeat fitting andremain demanding computationally.There is increasing integrationbetween geographical informationscience, computer science andstatistics.
Further readingChapter 9 of Statistics forGeography and EnvironmentalScience by Richard Harris and ClaireJarvis (Prentice Hall / Pearson, 2011)Includes a review of the followingkey concepts: cartograms; spatialanalysis; weights matrices; spatialeconometrics; geographicallyweighted regression; local indicatorsof spatial association; and multilevelmodelling.