Chapter 9
Statistical Data Analysis
An Introduction to Scientific
Research Methods in Geography
Montello and Sutton
Data Analysis
 Data Analysis
 Helps us achieve the four scientific goals of
description, prediction, explanation, and
control
 Statisical Data Analysis
 Three primary reasons geographers treat data
in a statisitical fashion
http://rlv.zcache.com/knowledge_is_power_do_statistics_stats_humor_fly
er-p2440846222778564182dwj5_400.jpg
Statistical Description
 Descriptive Statistics
 Parameters
 Central Tendency
 Mode
 Median
 Mean
 Arithmetic mean
 When would you use the median or the mode
instead of the mean?
,
X m
Descriptive Statistics
 Variability
 Range
 = largest value – smallest value
 Variance
 Standard Deviation
2
2 1
( )
N
i
i
x
N
m
s =
-
=
å
2
1
( )
N
i
i
x
N
m
s =
-
=
å
Descriptive Statistics
 Form
 Modality
 Skewness
 Positive
 Negative
 Symmetry
 Unimodal – Bell-shaped
 Normal Distribution
http://people.eku.edu/falkenbergs/images/skewness.jpg
Descriptive Statistics
 Derived Scores
 Percentile Rank
 Highest – 99th percentile
 Where is the median?
 Z-score
 Standard deviation units above or below the mean
x
z
m
s
-
=
Descriptive Statistics
 Relationship
 Linear Relationship
 Positive
 Negative
 Relationship Strength
 Weak, strong, no relationship
 Correlation Coefficient
 Between -1 and 1
 0 – no relationship
 Regression Analysis
 Criterion variables (Y)
 Predictor variables (X)
http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif
“Correlation doesn’t imply causation, but it does
waggle its eyebrows suggestively and gesture
furtively while mouthing ‘look over there’.” - XKCD
http://xkcd.com/552/
Correlation – Causation?
Statistical Inference
 Inferential Statistics
 Statistics
 Sampling error
 Given our sample statistics, we infer our
parameters
 Assign probabilities to our guesses
 Power and difficulty of inferential statistics
comes from deriving probabilities about how
likely it is that sample patterns reflect
population patterns
Inferential Statistics
 Sampling distribution
 Ex: sampling distribution of means – show the
probability that a single sample would have a
mean within some given RANGE of values
 Central limit theorem – sampling distribution
of sample means will be normal with a mean
equal to the population mean and a standard
deviation equal to the population standard
deviation divided by the square root of the
sample size
Inferential Statistics
 Generation of sampling distributions
 Assumptions
 Distributional assumptions
 Nonparametric
 Parametric
 Normality
 Homogeneity of variance
 Independence of scores
 Correct specification of models
Estimation and Hypothesis Testing
 Estimation
 Point estimation
 Confidence Interval
 Usually 95%
 Hypothesis Testing
 Null hypothesis
 A hypothesis about the exact (point) value of a
parameter or set of parameters
 Use sample statistics to make an inference about
the probable truth of our null hypothesis
Hypothesis Testing
 Alternative
Hypothesis
 Hypothesis that the
parameter does not
equal the exact value
hypothesized in the
null
 A range rather than an
exact value
 Modus Tollens
 Useful for
disconfirming
 Not confirming!
If A is true,
Then B is true
B is not true B is true
Therefore,
A is not true
Therefore, ???
Example
 From a recent nationwide study it is known that the
typical American watches 25 hours of television per
week, with a population standard deviation of 5.6 hours.
Suppose 50 Denver residents are randomly sampled
with an average viewing time of 22 hours per week and a
standard deviation of 4.8. Are Denver television viewing
habits different from nationwide viewing habits?
 Step 1: State your null and alternative hypotheses
 What is this saying?
0 : 25
: 25
A
H X
H X
=
¹
Example
 Step 2: Determine your appropriate test statistic and its sampling
distribution assuming the null is true
 We are testing a sample mean where n>30 and so a z distribution can
be used
 Step 3: Calculate the test statistic from your sample data
 Step 4: Compare the empirically obtained test statistic to the null
sampling distribution
 P value:
 OR Critical value at .05 significance level: z = ±1.96
 Decision: Reject the null hypothesis
 -3.79 is less than -1.96: reject
 The p value is very small, less than .05 and even .01: reject
22
4.8
50
X
s
n
=
=
=
25
5.6
m
s
=
=
22 25
3.79
/ 5.6 / 50
X
z
n
m
s
- -
= = = -
.0001
p =
Error
 You have made either a correct inference
or a mistake
 Type I error is the rejection level, p (or α)
 Type II error - β
http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business-
stat/error.gif
Data in Space and Place
 Spatiality is a focus in geography, unlike other disciplines
 Spatial autocorrelation
 First Law of Geography: Everything is related to everything else,
but near things are more related than distant things
 Positive v negative spatial autocorrelation
 A violation of the important statistical assumption of
independence
 Ex: If its raining in my backyard, I can say with a high degree of
confidence its raining in my neighbor’s backyard, but my level of
confidence that it is raining across town is lower, and 300 miles
away even lower
 Variogram
http://www.innovativegis.com/basis/Papers/Other/ASPRSchapter/
Default_files/image023.png
Data in Space and Place
 “Spatial data are special” – a special difficulty
 Which areal units should be used to analyze
geographic data
 Modifiable Areal Unit Problem
 Gerrymandering
 Geographic phenomena are often scale
dependent
 Must identify the scale of a phenomena and collect
and organize data in units of that size
 Data aggregation issues
Discussion Questions
 What measure of central tendency is best for nominal
data?
 When pollsters tell you that a candidate is favored by
44% of likely voters, plus or minus 3 percent, what is the
44% and what is the plus/minus 3%?
 A survey of all users of a park in 1980 found the average
number of people per party to be 3.5. In a random
sample of 35 parties in 2000 the average was 2.9. If you
wanted to test if the number of persons per party in 2000
was different from the number in 1980, what would your
null and alternative hypotheses be?
 In the United States, we presume that someone is
innocent. If a guilty person were found to be not guilty,
what type of error would this be?
 A researcher finds that a particular learning software has
an effect on student’s test scores, when actually it does
not. What type of error is this?

Chapter_9.pptx

  • 1.
    Chapter 9 Statistical DataAnalysis An Introduction to Scientific Research Methods in Geography Montello and Sutton
  • 2.
    Data Analysis  DataAnalysis  Helps us achieve the four scientific goals of description, prediction, explanation, and control  Statisical Data Analysis  Three primary reasons geographers treat data in a statisitical fashion http://rlv.zcache.com/knowledge_is_power_do_statistics_stats_humor_fly er-p2440846222778564182dwj5_400.jpg
  • 3.
    Statistical Description  DescriptiveStatistics  Parameters  Central Tendency  Mode  Median  Mean  Arithmetic mean  When would you use the median or the mode instead of the mean? , X m
  • 4.
    Descriptive Statistics  Variability Range  = largest value – smallest value  Variance  Standard Deviation 2 2 1 ( ) N i i x N m s = - = å 2 1 ( ) N i i x N m s = - = å
  • 5.
    Descriptive Statistics  Form Modality  Skewness  Positive  Negative  Symmetry  Unimodal – Bell-shaped  Normal Distribution http://people.eku.edu/falkenbergs/images/skewness.jpg
  • 6.
    Descriptive Statistics  DerivedScores  Percentile Rank  Highest – 99th percentile  Where is the median?  Z-score  Standard deviation units above or below the mean x z m s - =
  • 7.
    Descriptive Statistics  Relationship Linear Relationship  Positive  Negative  Relationship Strength  Weak, strong, no relationship  Correlation Coefficient  Between -1 and 1  0 – no relationship  Regression Analysis  Criterion variables (Y)  Predictor variables (X) http://hosting.soonet.ca/eliris/remotesensing/LectureImages/correlation.gif
  • 8.
    “Correlation doesn’t implycausation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.” - XKCD http://xkcd.com/552/ Correlation – Causation?
  • 9.
    Statistical Inference  InferentialStatistics  Statistics  Sampling error  Given our sample statistics, we infer our parameters  Assign probabilities to our guesses  Power and difficulty of inferential statistics comes from deriving probabilities about how likely it is that sample patterns reflect population patterns
  • 10.
    Inferential Statistics  Samplingdistribution  Ex: sampling distribution of means – show the probability that a single sample would have a mean within some given RANGE of values  Central limit theorem – sampling distribution of sample means will be normal with a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of the sample size
  • 11.
    Inferential Statistics  Generationof sampling distributions  Assumptions  Distributional assumptions  Nonparametric  Parametric  Normality  Homogeneity of variance  Independence of scores  Correct specification of models
  • 12.
    Estimation and HypothesisTesting  Estimation  Point estimation  Confidence Interval  Usually 95%  Hypothesis Testing  Null hypothesis  A hypothesis about the exact (point) value of a parameter or set of parameters  Use sample statistics to make an inference about the probable truth of our null hypothesis
  • 13.
    Hypothesis Testing  Alternative Hypothesis Hypothesis that the parameter does not equal the exact value hypothesized in the null  A range rather than an exact value  Modus Tollens  Useful for disconfirming  Not confirming! If A is true, Then B is true B is not true B is true Therefore, A is not true Therefore, ???
  • 14.
    Example  From arecent nationwide study it is known that the typical American watches 25 hours of television per week, with a population standard deviation of 5.6 hours. Suppose 50 Denver residents are randomly sampled with an average viewing time of 22 hours per week and a standard deviation of 4.8. Are Denver television viewing habits different from nationwide viewing habits?  Step 1: State your null and alternative hypotheses  What is this saying? 0 : 25 : 25 A H X H X = ¹
  • 15.
    Example  Step 2:Determine your appropriate test statistic and its sampling distribution assuming the null is true  We are testing a sample mean where n>30 and so a z distribution can be used  Step 3: Calculate the test statistic from your sample data  Step 4: Compare the empirically obtained test statistic to the null sampling distribution  P value:  OR Critical value at .05 significance level: z = ±1.96  Decision: Reject the null hypothesis  -3.79 is less than -1.96: reject  The p value is very small, less than .05 and even .01: reject 22 4.8 50 X s n = = = 25 5.6 m s = = 22 25 3.79 / 5.6 / 50 X z n m s - - = = = - .0001 p =
  • 16.
    Error  You havemade either a correct inference or a mistake  Type I error is the rejection level, p (or α)  Type II error - β http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business- stat/error.gif
  • 17.
    Data in Spaceand Place  Spatiality is a focus in geography, unlike other disciplines  Spatial autocorrelation  First Law of Geography: Everything is related to everything else, but near things are more related than distant things  Positive v negative spatial autocorrelation  A violation of the important statistical assumption of independence  Ex: If its raining in my backyard, I can say with a high degree of confidence its raining in my neighbor’s backyard, but my level of confidence that it is raining across town is lower, and 300 miles away even lower  Variogram http://www.innovativegis.com/basis/Papers/Other/ASPRSchapter/ Default_files/image023.png
  • 18.
    Data in Spaceand Place  “Spatial data are special” – a special difficulty  Which areal units should be used to analyze geographic data  Modifiable Areal Unit Problem  Gerrymandering  Geographic phenomena are often scale dependent  Must identify the scale of a phenomena and collect and organize data in units of that size  Data aggregation issues
  • 19.
    Discussion Questions  Whatmeasure of central tendency is best for nominal data?  When pollsters tell you that a candidate is favored by 44% of likely voters, plus or minus 3 percent, what is the 44% and what is the plus/minus 3%?  A survey of all users of a park in 1980 found the average number of people per party to be 3.5. In a random sample of 35 parties in 2000 the average was 2.9. If you wanted to test if the number of persons per party in 2000 was different from the number in 1980, what would your null and alternative hypotheses be?  In the United States, we presume that someone is innocent. If a guilty person were found to be not guilty, what type of error would this be?  A researcher finds that a particular learning software has an effect on student’s test scores, when actually it does not. What type of error is this?