Statistical Methods Workshop

Prof Ashis Sarkar, W.B.S.E.S.
Managing Editor & Publisher: Indian Journal of Spatial Science
www.indiansss.org
Sampling and Probability in Geography
7-Day Workshop (11 – 17 July, 2017)
On
“Statistical Methods and Techniques of Spatial Analysis”
organized by
Women’s Christian College, Kolkata
in collaboration with
Department of Geography, University of Calcutta

Why “Statistics” in Geography??

1. Geography is rooted in the ancient practice, concerned with the
characteristics of places, in particular the interrelations between natural
habitat, economy and society.
2. Its unique identity was given by Eratosthenes (276–194 BC): geo+graphein
= “earth description.” It seeks to answer the questions of — why things are
a) as they are, and
b) where they are.
Thus, it is concerned with the characteristics of places:
1. Location and
2. Attributes at Locations
3. Currently, Geography seeks —
a) to classify phenomena,
b) to compare,
c) to generalize,
d) to ascend from effects to causes, and in doing so,
e) to trace out the laws of nature and to mark their influences upon man,
thereby simply making it 'a description of the world’.
4. Hence, it is a SCIENCE — of argument and reason, and of course, of
cause and effect.

1. It helps us to understand the principles of living with nature….
2. It teaches us to put things in perspective….
3. It helps us to understand, live and exist in the current world peacefully
with others…
4. It helps us to ensure social order, social justice and sustainable smart
living….
5. It builds a bridge between the natural / physical and social sciences..
The spatial order on the earth’s surface can be defined, identified and
analysed (measured, monitored, mapped and modelled) with a scientific
geographical mind in space–time frame.
It forms the “philosophical foundation of the discipline of Geography, which is
both inter-disciplinary and multi-disciplinary.
The effects of QR (late 1950s–80s) are manifested in the following ways —
1. Statistical Techniques remains a key and virtually universal element in
the training of new breed of quantitative geographers from positivist
approach.
2. The RS/GIS revolution renewed the resurgence of interest in Spatial
Statistics (e.g., TFL and TSL).
3. Besides, statistical techniques helped the Marxist, Structuralist, Political,
Economic and even Behaviouralist Geographers

1. Geographers study —
a) how and why elements differ from place to place, as well as
b) how spatial patterns change through time.
2. Thus, geographers begin with the question ‘where?',
exploring how features are distributed on a physical or
cultural landscape,
observing the spatial patterns and its trend.
3. Contemporary geographical analysis has shifted from ‘where’
to ‘why’—
a) why a specific spatial pattern exists,
b) what spatial processes have produced such pattern, and
c) why such processes operate.
3. Only via these 'why' questions, geographers begin to
understand the mechanisms of change, which are, in fact,
infinite in their complexity.

Geography needs it because―
1. These help summarize the findings
of studies (e.g.: total rainfall during a
period in a state),
2. These help understanding the
phenomenon under study (e.g.:
rainfall is more in the southern
districts),
3. These help forecast the state of
variables (e.g.: draught is likely to
occur next year),
4. These help evaluate performance of
certain activity (e.g.: more rainfall
means more rice production),
5. These help decision making (e.g.:
finding out the best location),
6. They also help to establish whether
relationships between the
‘characteristics’ of a set of
observations are genuine or not, and
thus make a valuable contribution to
Geographical knowledge.
Geographers use Statistical
Methods —
1. To describe and summarize
spatial data.
2. To make generalizations
about complex spatial
patterns.
3. To estimate the probability of
outcomes for an event at a
given location.
4. To use sample data to infer
characteristics for a larger set
of geographic data.
5. To determine if the magnitude
or frequency of some
phenomenon differs from one
location to another.
6. To learn whether an actual
spatial pattern matches some
expected pattern.

1. Statistical analysis in Geography
is unique as it concerns ‘spatial’ or
‘geographically referenced’ data
(with co-ordinates).
2. The variety of techniques being
almost infinite, the GDA has to
pick and choose the best suited
one for his specific job.
3. Today user-friendly software
packages are easily available,
e.g., SPSS, Statistica, R, etc
4. Processing of geographical data
involves the ‘application of
suitable’ techniques of spatial
statistics.
5. Its presentation requires the
‘application of the most suited’
cartographic techniques.
6. Its interpretation needs the ‘wisest
use of geographical principles’
leading ultimately to the scientific
geographical explanation.
Spatial data have difficulties associated with
its analysis: boundary delineation,
modifiable areal units, and the level of
spatial aggregation or scale. That is why,
methods of statistical analysis varies:
1. A study of /capita income within a city, if
confined to the inner core shows that
income levels are lower, but if whole city
is taken, it become higher. In the
determination of internal boundaries this
is also true. Thus interpretations are valid
for the area and sub-area configuration of
a region.
2. In Census, the available information is
hierarchically arranged. The GDA must
use the same level for comparison.
3. Socio-economic data are available at a
variety of scales, e.g., Ward, Municipality,
Mouza, GP, Block, District, and State:
When it is aggregated at different scales,
the resulting descriptive statistics may
exhibit variations, either in a systematic,
predictable way, or in a more uncertain
fashion.

Geographical data has two important properties —
1. the reference to geographic space, which means the data are
registered to a geographical co-ordinate system so that data from
different sources can be cross – referenced and integrated, and
2. the representation at geographical scale; data are normally recorded at
relatively small scales and are often generalised and symbolised.
Therefore, geographical data / spatial data / map data are essentially scaled
spatial data. It falls into three categories —
1. the geodetic control network (GCN) {Reference Map},
2. the topographic base (tBase) {Contour Map}, and
3. the geographic overlays (GrO) {Thematic Map}
Geographical data are recorded using four different means —
1. Alphanumeric characters or texts (Document Form: document data),
2. Numerical values (Numeric Form: numeric data),
3. Symbols or signs (Graphical Form: graph/diagram/map data), and
4. Signals (Digital Form: digital data)

Geographical data concern facts about the geographic reality and
geographical facts are always inter-subjective and reliable.
Inter-subjective ⇒ repeated observations of the same phenomena by
different people yield the same factual statement
Reliable ⇒ an observer repeatedly recording the same
phenomena produces the same factual statement
A geographer’s data matrix (GDM) is a matrix of individuals (objects or events)
against attributes (various observations made on the attributes of these
individuals).
Places Co-ordinates a1 aN
x y
1 .. .. .. ..
2
3 .. .. .. ..
N .. .. .. ..
1. A single data set typically contains
information on many variables and
many objects.
2. Conventionally, variables are entered
in columns and objects in rows.
3. Thus, each measurement on a row
genuinely concerns the same object
and all data in a single column relate
to the same variable.
4. The entire contents of a column are
normally analysed together such that
the analysis essentially concerns
comparable data (GDM).

Real world features occur in two fundamental forms –
1. Objects (these are discrete and definite: e.g., hills, rivers, forests,
mines, highways, cities, bio-reserves) and
2. Phenomena (these are continuously distributed over a geographical
space: e.g., terrain, slope, rainfall, temperature, pollution level and
other environmental indices).
A spatial object has three essential characteristics –
1. It can be delineated by identifiable boundaries,
2. it can be described by one or more attributes, and
3. it is relevant to some intended application.
Spatial objects may be:
1. “exact objects”, when they have well defined boundaries (e.g.,
landholding, building, etc) and
2. “inexact objects” or “fuzzy entities”, when such boundaries are not well
defined (e.g., landform features and natural resource features).
Graphically, spatial objects are represented as three geometric elements –
points (0D), lines (1D), and polygons (2D).

There are two distinct ways of the real world representation in geographic
database –
1. the object-based model (it views geographic space as populated by sets of
objects, for which data are obtained both by field surveying and laboratory
analysis. In such databases, attributes are arranged against locations
defined by co-ordinate lists or vector lines. Hence, it is called the “vector
data structure”), and
2. the field-based model (it views geographic space as populated by one or
more phenomena, that are basically represented as “surfaces”; these are
often conceptualised as being made up of a number of spatial data units.
Data for these are normally arranged against locations defined by
elemental tessellation in the form of “square” or “rectangular” cells. Hence,
such data structures are called “raster data models”.
The field-based spatial data can be obtained either
1. Directly from aerial photography, satellite imagery, map scanning and field
measurements at selected or sampled locations (e.g., data for triangulated
irregular networks or TIN) or
2. Indirectly generated applying mathematical functions, e.g., contours and
digital elevation models (DEM).

Geographical data are often collected on the basis of Areal Units—
1. These may be Natural, e.g., plain, plateau, mountains, etc.
2. It can well be Artificial, e.g.,
• Singular units (e.g., individual households or landholding) or
• Collective / Administrative divisions (i.e., regions made up of many
such landholdings).
These units are often arranged into some sort of hierarchical structure—
India = {States} State = {Districts} Districts = {CD Blocks}
CDB = {Gram Panchayats} GP = {Landholdings/Households}
At each level these have different but nested hierarchic identification code that
makes them unique (Census).
In such a perfect set-hierarchy, comparisons can only be made between similar
individuals occupying the same level in the hierarchy:
GP vs. GP, CDB vs. CDB, District vs. District, State vs. State
Thus, GP level inferences may not hold good for the CDB / District / State level.
1. A “high level” to “low level” analysis yields a contextual relationship and
2. A “low level” to “high level” analysis yields an aggregate relationship.
These are unique and require its own particular mode of thought about data
collection, storage, and procedure for manipulation and analysis.

Based on “simple scalar systems” for measuring objects or their attributes,
Stevens (1959) describes four kinds of measurement models or scales with one
or two hybrid variants ―
1. Nominal or Categorical Data: It provides a device for labelling or classifying
objects rather than measuring its attributes. These are qualitative data,
often presented in the form of names. Special “statistical techniques” are
used for this sort of data.
2. Ordinal Data: In this, a set of objects is ordered from the "most“ to the
"least" but in which there is actually no information regarding the value of
the measured attributes.
Thus, it produces poor quality data and is not valid for algebraic operations
(i.e., addition, subtraction, multiplication or division) and requires the use of
non-parametric statistical methods for analysis.
3. Interval Data: In this, one can not only rank order objects with respect to a
measured attribute but is also able to specify how far apart the magnitudes
are from each other.
However, it has no natural origin: lack of “absolute zero”.

4. Ratio Data: It incorporates equivalence of ratios and starts from an absolute
zero. It is most precise as it uses the real number system and allows
continuous measurement. Units can be converted directly from one system
to another. It can be divided in various ways, as follows ―
a) Continuous and Discrete data.
b) Directional and Non-directional data.
c) Closed and Open data.
d) Positive and Negative data.
e) Point (place) and Period (time) data.
f) Spatial and Non-spatial/ Attribute data.
Scale of Measurement is very important, because it determines the Techniques
and Measures of —
1. Data Description (…)
2. Correlation and Association between and among Variables (…)
3. Regression and Estimation (….)
4. Comparison and Tests of Significance (…)
5. Randomness, Order and Regularity (…)
6. Forecasting and Decision Making (…)
7. Understanding the Phenomena (…)
8. Cartographic Presentation (…)

1. There is no such thing as an exact measurement.
2. All measurements contain error, the magnitude being dependent on the
instruments used and also the ability of the observer to use.
3. As the true value is never known, the true error is never determined.
4. The degree of accuracy, or its precision, can only be called a relative
accuracy.
5. The estimated error is shown as a fraction of the measured quantity. Thus
100 ft measured with an estimated error of 1 inch represents a relative
accuracy of 1 / 1200. An error of 1 cm in 100 m = 1 / 10,000.
6. Where readings are taken on a graduated scale to the nearest subdivision,
the maximum error in estimation will be ± 1/2 division.
7. Repeated measurement increases the accuracy by √n, where n = number
of repetitions. (N.B. This cant not be applied indefinitely).
8. Agreement between repeated measurements does not imply accuracy but
only consistency.
To Note…..

Measurement
Scale
Ratio, Interval,
Ordinal,
Nominal Data
Ratio, Interval,
Ordinal Data
Ratio, Interval
Data
Ratio Data
1-Component Case
- f, Mo Me, Px Mean, Variance GM, HM, CV
2-Component Case
Nominal Scale
Data
Ordinal Scale
Interval + Ratio
Scale
Chi Square
Contingency
Coefficient
U Test
Spearman’s rs
Kendall’s τ
Comparison of Mean
Comparison of Variance
Pearson’s r
Linear and Non-Linear
Regression
Multi-Component Case
Interval + Ratio
Scale
Multiple Variance Analysis
Co-Variance Analysis
Multiple Correlation
Multiple Regression
Statistical Tests / Methods and Measurement Scale

Scale
System
Defining
Relation
Possible
Transformations
Central
Tendency
Dispersion Tests
Nominal
Equivalence
y = f ( x ),
where f ( x )
means any
one-to-one
substitution
Mode % in the
Mode
Non - parametric :
Chi - square,
Contingency coefficient,
Goodman-Kruskal's
Lambda,
Phi - coefficient
Ordinal
Equivalence
Greater than
y = f ( x ),
where f ( x )
means any
increasing
monotonic
transformation
Median Percentiles
Non - parametric :
Spearman's Rho
Kendall's Tau
Kolmogorov-Smirnov
Goodman-Kruskal's
Gamma
Phi – coefficient
Interval
Equivalence
Greater than Known
ratio of any two
intervals
any linear
transformation :
y = a. x + b
( a > 0 )
Mean Standard
Deviation
Parametric and
Non-parametric :
t - test
F - test
Pearson's r
Point Biserial etc.
Ratio
Equivalence
Greater than
Known ratio of any
two intervals
Known ratio of any
two scale values
y = c. x
( c > 0 )
Geometric
Mean
Coefficient
of
Variation
Parametric and
Non-parametric :
t - test
F - test
Pearson's r
Point Biserial etc.
Scale of Measurement and Statistical Measures and Tests

Levels of Measurement Mapping Techniques
Nominal Scale Chorochromatic Map
Choroschematic Map
Ordinal Scale Flowline Map
Ray Map
Qualitative Dot Map
Interval and Ratio Scale Isarithmic Maps
Quantitative Dot Map
Choropleth Map
Animated Map
Landform Map
Cartograms
Single Component Map
Bi-component Map
Multi-component Maps
Mapping Techniques and Measurement Scale

Sources of Geographical Information
1. Field Observations
a) Quantitative Measurements
b) Qualitative Observations
2. Archival Sources
a) Areal Archives: Maps, Air-photo, Satellite Image
b) Non-areal Sources: Census Data
3. Theoretical Works
a) Mathematical Models
b) Analogue Models
i. Physical Simulation Models
ii. Monte Carlo Simulation Models
1. Locational Data are collected mainly for non-geographical purposes.
2. Geographers depend on the original accuracy of the Survey.
3. Data are released in ‘bundles’, i.e., administrative areas which are
inconvenient and anachronistic and pose extremely acute problems in
mapping and interpretations.

1. Sampling is the process of selecting units (e.g., people,
organizations) from a population of interest so that by studying
the sample we may fairly generalize our results back to the
population from which they were chosen.
2. It is necessary because we usually cannot gather data from
the entire population due to large or inaccessible population or
lack of resources
3. It is used to draw inferences about the population as a whole.
4. The subset of units that are selected is called a sample.
Sampling
The purpose of geography is ‘...... to provide accurate, orderly, and rational
description and interpretation of the variable character of the earth surface’
(Hartshrone, 1959).
This can only be achieved in two particular ways ―
1. first, by increasing the database to an optimum level, and
2. second by taking representative samples.

Principles of Sampling
1. A statistical population (universe) is defined as the total set of measurements
which could be taken from the entity being studied.
2. A geographical population is too large to have either the time or money to
examine.
3. Therefore, we select from the population some sub-set of individuals for
study. These are called ‘samples’, and the selection procedure, ‘sampling’.
For example, all students of a college are considered as a population, but
students of a particular class may be taken as samples.
A sample of entities should faithfully represent its parent population so that
estimates from sample can be an accurate estimate of the corresponding figure
in the whole population. The equivalent actual figure in the population is known
as a parameter.
The fundamental principles of sampling are ―
1. Sample units must be chosen in a systematic and objective manner
2. They must be clearly defined and easily identifiable
3. They must be independent of one another
4. Same units of samples should be used throughout the study
5. Selection process should be based on sound criteria and should avoid
error, bias and distortions.

1. There are several measures of inferential statistics to test whether any
significant difference is there between sample statistic and population
parameter. If the difference is nil, samples are representative.
2. However, inaccuracy may be there due to errors in measurement,
computation, and data processing.
3. Therefore, the answer to the question as to which sampling to be used:
choose the method that gives the most representative results.
4. The number of items to be included in a sample yields the concept of
sampling fraction that can be estimated using techniques available for a
certain type of sampling.
Basically, the following methods are there ―
A. Non-probability Sampling: based on researcher’s subjective mind
1. Judgmental or Purposive (when only minimal information about the
parent population is available, or only certain sites are accessible)
2. Convenience or Accessibility (based on the ease of access to the sites
or members of population)
3. Quota (a combination of the above two: first the no. of observations are
decided and then samples are finally identified from these)
4. Volunteer (when certain members of the population / respondents
volunteer information)

B. Probability Sampling: based on probability, i.e., an element of chance exists
as to whether a particular observation is included in a sample.
1. Random: Samples are collected by chance methods or random numbers.
Random numbers could be generated in Calculator or a PC.
2. Systematic: By numbering each subject and then selecting the k th
subject .
3. Stratified: The population is divided into groups (strata) at first depending
on the `importance of study. Then samples were selected randomly within
each strata.
4. Cluster: The population is divided into groups (clusters) by some means
(geographic area). Then some clusters are randomly selected.
For Example:
1. geomorphologists collect slope data often based on quota sampling;
2. meteorologists capture weather patterns by sampling at measuring stations,
knowing that conditions between these sample locations will vary
systematically and smoothly.
3. Census data are published in spatially aggregated form and the TFL ensures
that the variance within each aggregate unit is not so large as to render the
aggregated data meaningless.

1. The ability to represent geographical features as points is certainly scale-
dependent:
a school may be a point feature on 1: 50000 scale maps, but not
on a 1: 2500 scale map.
2. In geography, point distribution is basic and the most fundamental: as lines
are simply ordered set of points and, areas are polygons / closed traverses
formed by lines. Hence, the sampling unit in most geographic studies is
either a point or a quadrat.
3. The basic purpose of analysis is to identify and distinguish homogeneity vs.
heterogeneity, isotropy vs. anisotropy, and randomness.
There are five types of Spatial Sampling ―
1. Random: By using a grid overlay on a map, a population of co-ordinate
points can be generated, each one of which can be identified by its x and y
– coordinates, and random numbers (from Tables, Calculators, PCs, etc)
can then be used for the selection of random samples.
2. Regular or Gridded: (also called, systematic area sampling planned on
perfectly square, or rectangular or triangular grids) The area is first divided
into uniform grids and samples are selected in a regular manner from each
grid.
3. Clustered: It is usually not planned, but often forced by patchy distribution
of objects.

4. Uniform or Systematic Random: It is initiated by the random selection of a
points or quadrats; then following a predetermined plan the remainder of
the sample is selected (e.g., planned by randomization within grid squares).
1. There are two basic designs for this ― one has points aligned in a
checkerboard fashion, while in the other, the points are non-aligned.
2. In the first case, the area is divided into a set of grids all containing the
same number of points. One point is chosen randomly in the first cell;
the corresponding point locations in other cells are then selected as the
remaining sample members.
3. In the second case, the first point is chosen as above. Its x – coordinate
is then held constant for all the remaining grids across the top row and a
point in each of these squares is then chosen by randomly selecting new
y – coordinates. Similarly, for other cells in the first column, the y –
coordinate of the point in the first cell is now held constant, while the x –
coordinate is randomized.
5. Traverse: The use of cross-sectional lines and traverses has always been
favoured in geographic approaches and found more efficient in estimating
spatial pattern. These are also the common practice, especially when
constrained by access and exposure, viz., contours, rivers, coasts, lakes,
volcanoes, roads, roads, etc.

As no single technique can satisfy a geographer’s objective, he is
free to make his own sampling design, e.g., quota, multi-phase,
multi-stage, etc (Haggett, 1968), as follows ―
Data Collecting System
Purposive or Hunch Sampling Controlled Sampling Complete Survey
Random Designs Systematic Designs
With Purposive With Systematic Unaligned Designs
Stratification Stratification
Nested or Stratified Systematic
or Hierarchical Designs Unaligned Designs

1. Any geographical population certainly have some internal variation, which
may either be random or systematic.
2. However, we may wish to compute statistics relating to just the overall
characteristics of the population.
3. If heterogeneity is apparent at the outset, the whole should be broken into
sub-populations and sampled and analysed separately.
4. In the special case of gradients or trends of variation, this must be defined
by traverses parallel to the gradient.
5. When variation is not so distinctive, all parts of the area need to be explored
equally. Hence, a uniform or gridded design is preferable as random
distributions may be too uneven.
6. As a general rule of thumb, a sample size of 30 is usually sufficient for the
sampling distribution of the means to be closely approximated by a normal
distribution.
Thus, acquisition of data comprise the following sequence:
define the population → construct a sample frame → select
a sampling design → specify the characteristics to be
measured for each element of the sample → collect the data

1. There are many methods of sampling when doing research.
2. Simple random sampling is the most ideal one.
3. But researchers seldom have the luxury of time or money to access the
whole population.
4. Hence, many compromises often have to be made.
Method Best when
Simple Random Sampling Whole population is available.
Stratified Sampling
(random within target
groups)
There are specific sub-groups to investigate
(e.g. demographic groupings).
Systematic Sampling
(every nth person)
When a stream of representative people are
available (e.g. in the street).
Cluster Sampling When population groups are separated and
access to all is difficult, e.g. in many distant cities.
Probability Methods
This is the best overall group of methods to use as you can subsequently use
the most powerful statistical analyses on the results.
Summary

Quota Methods
1. For a particular analysis and valid results, you can determine the number of
people you need to sample.
2. In particular when you are studying a number of groups and when sub-
groups are small, then you will need equivalent numbers to enable
equivalent analysis and conclusions
Method Best when
Quota Sampling
(get only as many as you need)
You have access to a wide
population, including sub-groups.
Proportionate Quota Sampling
(in proportion to population sub-groups)
You know the population distribution.
across groups, and when normal
sampling may not give enough in
minority groups.
Non-Proportionate Quota Sampling
(minimum number from each sub-group)
There is likely to a wide variation in
the characteristic within minority
groups.

Selective Methods
Sometimes your study leads you to target particular groups.
Method Best when
Purposive Sampling
(based on intent)
You are studying particular groups
Expert Sampling
(seeking ‘experts’)
You want expert opinion
Snowball Sampling
(ask for recommendations)
You seek similar subjects (eg. young
drinkers)
Modal Instance Sampling
(focus on typical people)
When sought 'typical' opinion may get
lost in a wider study, and when you are
able to identify the 'typical' group
Diversity Sampling
(deliberately seeking variation)
You are specifically seeking
differences, eg. to identify sub-groups
or potential conflicts

Convenience Methods
1. Good sampling is time-consuming and expensive.
2. Not all experimenters have the time or funds to use more accurate
methods.
3. There is a price, of course, in the potential limited validity of results.
Method Best when
Snowball Sampling
(ask for recommendations)
You are ethically and socially able to ask and
seek similar subjects.
Convenience Sampling
(use who's available)
You cannot proactively seek out subjects.
Judgement Sampling
(guess a good-enough sample)
You are expert and there is no other choice.

Ethnographic Methods
1. When doing field-based observations, it is often impossible to intrude into
the lives of people you are studying.
2. Samples must thus be surreptitious.
3. It may be based more on who is available and willing to participate in any
interviews or studies.
Method Best when
Selective Sampling
(gut feel)
Focus is needed in particular group, location,
subject, etc.
Theoretical Sampling
(testing a theory)
Theories are emerging and focused sampling
may help clarify these.
Convenience Sampling
(use who's available)
You cannot proactively seek out subjects.
Judgement Sampling
(guess a good-enough sample)
You are expert and there is no other choice.

Estimates from Samples
1) Population Mean
For this, the Standard Error of Mean (SEm ) is first calculated using the
following equation — SEm = s / √n, where, s = sample standard deviation,
and n = size of sample.
The equation has been contrived in such a way that —
1. Population Mean = Sample Mean ± SEm with 0.682 probability
2. Population Mean = Sample Mean ± 2SEm with 0.954 probability
3. Population Mean = Sample Mean ± 3SEm with 0.997 probability
Thus with a certain probability, the range of population mean can be easily
estimated.
Example:
Let for a data set of 100, Sample Mean = 12.34 and s = 2.56
Therefore, SEm = s / √n = 2.56 / √100 = 0.256
Thus,
Population Mean lies between 12.084 and 12.596 at 0.682 probability

2) Population Standard Deviation
The Standard Error of the Standard Deviation (SES ) is first calculated from the
following Equation —
SES = s / √(2n)
where, s = sample standard deviation, and n = size of the sample.
The equation has been contrived in such a way that —
1. σ= s ± SES with 0.682 probability
2. σ= s ± 2SES with 0.954 probability
3. σ= s ± 3SES with 0.997 probability
Thus with a certain chosen probability, the range of population standard
deviation (σ) can be easily estimated.
Example
For population standard deviation, SES = s / √(2n)= 2.56 / √(2x100)= 0.181
Thus,
Population SD lies between 2.379 and 2.741 at 0.682 probability

3) Proportion of Population
The standard error of a percentage is first estimated, as follows —
SE% = √(p.q / n)
where p is the percentage of a sample possessing the particular attribute, q is
the percentage of the sample not possessing that attribute, and n is the number
of individuals in the sample.
The Population % can be easily estimated by using equations, as follows —
1. Population % = Sample % ± SE% with 0.682 probability
2. Population % = Sample % ± 2SE% with 0.954 probability
3. Population % = Sample % ± 3SE% with 0.997 probability
Example
Let in a household survey, a sample of 100 produced an estimate of 75% for
the percentage of the households have broad band connection (p). Therefore,
the estimated percentage of the households not having broad band connection
is 25% (q = 100 – 75).
SE% = √(p.q / n) = √(75x25 /100) = 4.33
Hence,
Proportion of HH in the Population with a BB connection lies between 70.67%
and 79.33% at 0.682 probability, between 66.34% and 83.66% at 0.954
probability and between 62.01% and 87.99% at 0.997 probability.

4) Sample Size
Sample size (n) can also be estimated with known s, known SEm or SEs or
SE% within the confidence limits as follows —
n = (s/SEm)2 at 0.682
= (2s/SEm)2 at 0.954, and
= (3s/SEm)2 at 0.997 probability,
n = (s/2SEs)2 at 0.682
=(2s/2SEs)2 at 0.954, and
=(3s/2SEs)2 at 0.997 probability
n = p.q/(SE%)2 at 0.682
= 2p.q/(SE%)2 at 0.954, and
= 3p.q/(SE%)2at 0.997 probability
Example: For s = 3.45 and SEm= 0.25
at 0.682 probability, n = (s/SEm)2 = (3.45/0.25)2 = 190
at 0.954 probability, n = (2s/SEm)2 = (2x3.45/0.25)2= 762
at 0.997 probability, n = (3s/SEm)2 = (3x3.45/0.25)2= 1714
Note:
1. Sometimes, it may so happen that n becomes larger than the population. It is the
major drawback of the methods.
2. However, it should not be used without judgement as this is commonly used in
situations in which population size is uncountably large.

One thing about “future”, of which we can be “certain” is
that it will be “utterly fantastic”
1. Probability is the fundamental building block for inferential statistics.
2. It uses theory to make confidence statements about the
characteristics of populations based on sample information, or to
test hypotheses.
3. It provides a quantitative description of the likely occurrence of a
particular event, and are expressed on a scale from 0 to 1 (or 0 to
100%).
4. A rare event has a probability of occurrence close to 0, while a very
common event has a probability of occurrence close to 1.
5. We encounter probability statements on an almost daily basis, e.g.,
chance of rain, or chance of a major earthquake, etc
6. It can be obtained in different ways. Some are purely subjective,
based on 'gut feeling' or 'best guess’, while others are either based
on observation or derived from theory.

A common way of obtaining probabilities is from data. The probability of
an event occurring, written as P(E), is defined as the proportion of times
the event occurs in a series of trials.
Example: to assess the probability of rain on a given day in Kolkata
between November and February:
Total No. of Days: 4 months = 120 days
Observed Rainy Days = 88 days
Probability of Rain (November to February): P(rain) = (88 / 120)
= 0.73
Conclusion:
1. The probability of Rain on any day during Winter in Kolkata, is 0.73
or 73% and
2. The probability of a Sunny Day during this period in Kolkata is (1 -
0.73) = 0.27 or 27%
Rain, Snow, Monsoon, El Nino, Soil Fertility, Cyclone, Earthquake, Slope
Failure, Lanslide, High Tide, Flood, River Stage, Species Diversity, ……
Crop Production, Industrial Production, Population Growth, Migration, …….
Trend: both temporal and spatial,…etc

Probability Theory and Normal Distribution
Concept of Probability
1. The post-quantitative revolution period viewed the organisation of
geographic elements over the earth’s surface and that of spaces as a matter
of chance.
2. Geographical objects are true representations of multivariate situations;
hence, the application of probability statistics. To understand this, the basic
concept of the following need to be explored first —
a) Random Experiment — are those whose results depend on chance.
When a coin is tossed, either head or tail
appears; but the result can not be predicted in
advance as it depends on chance.
b) Outcome — The result of a random experiment is
called an outcome.
c) Event — denotes what occurs in a random experiment.
These are of two types –
1. Elementary (i.e., can not be decomposed
into simpler ones) and
2. Composite (i.e., aggregates of several
elementary events).

d) Mutually Exclusive Events: when two or more of them can not occur
simultaneously. Such events can occur only
one at a time.
d) Exhaustive Events: formed by the complete group of all possible
events of a random experiment. For example
in coin tossing, the two events ‘head’ and ‘tail’
comprise an exhaustive set.
e) Trial: It refers to any particular performance of the
random experiment.
f) Cases Favourable to an Experiment:
g) In dice throwing, there are 6 possible outcomes. Of
these, 3 cases (1, 3, 5) are favourable to ‘odd number of points’
and 3 cases (2, 4, 6) are favourable to ‘even number of
points’.
h) Equally Likely:
If all outcomes occur with equal certainty; i.e., none of the
outcomes is expected in preference to another.
1st C 2nd C 3rd C Elementary Events TREE DIAGRAM
B BBB FAMILY
B
G BBG
B B BGB
G

Definition of Probability
(1) Classical Definition
If there are n possible outcomes (that are mutually exclusive, exhaustive and
equally likely) in a random experiment, and m of these are favourable to an
event A, then the probability of the event,
P(A) = m / n
1. When, P(A) = 0, the event is impossible, i.e., m = 0; it occurs when none of
the outcomes is favourable to the event (RARE EVENT).
2. When P(A) = 1, the event is certain, i.e., m = n; it occurs when all the
outcomes are favourable to the event (ABSOLUTELY CERTAIN EVENT).
The classical definition fails when the number of possible outcomes is infinitely
large. It has only limited applications in coin tossing, dice-throwing and similar
games.
(2) Empirical Definition
In N trials of a random experiment, if an event occurs f times, its relative
frequency is f /N. If it → a limiting value p, as N → infinity, then p is called the
‘probability’ of the event. Thus,
p = lim f / N).
N →∞

(3) Axiomatic Definition
a) It introduces ‘probability’ simply as a number associated with each
event, based on certain axioms.
b) Thus, it embraces all situations. The classical theory is simply a special
case of axiomatic theory.
c) Let a random experiment has a ‘finite’ number (n) of possible
‘elementary outcomes’, e1, e2, ……….……. ,en.
i. the set , S = {e1, e2, ……. , en } is called its ‘sample space’ and its
elements ei are called sample points.
ii. The sets {ei} consisting of single elements are called ‘elementary
events’, while those constituting more than one are called
‘composite events’.
iii. The ‘null set’, Ф is called of the impossible event and
iv. The ‘universal set’ is called the sure event.
a) Let the real numbers p1, p2, ……. , pn correspond to the elementary events
{e1}, {e2}, ……. , {en} respectively such that pi ≥ 0 and ∑pi = 1.
b) The numbers pi are called probabilities assigned to the elementary events
Ai = ei.
c) The probability of any event, A = {e1, e2, ……. , en} is then given by the sum
of the probabilities associated those outcomes belonging to this event.
Hence, P(A) = p1 + p2 + …. + pn

Probability Distribution
The probability distribution of a random variable is a statement that specifies
the set of its possible values together with their respective probabilities.
1) Discrete Probability Distribution
Let a discrete random variable X assume the values x1, x2, ….., xn with
probabilities p1, p2, ……. , pn respectively, where ∑pi = 1. the specification
of the set of values xi together with their probabilities pi (i = 1, 2, 3, …., n)
defines the discrete probability distribution of X.
Mathematically, f(x) = probability that X assumes the value x
= P (X = x)
The function f(x) is called the probability mass function (p.m.f.). It satisfies
two conditions —
a) f(x) ≥ 0 and
b) ∑ f(x) = 1

a) Uniform Distribution — When a discrete random variable assumes n
possible values with equal probabilities, then the probability becomes a
constant value 1/n.
The p.m.f is given by f(x) = 1/n
b) Binomial Distribution —
It is defined by the p.m.f.
f(x) = nCx px qn-a (x = 0, 1, 2, .., n),
where p and q are positive fractions (p+q = 1)
c) Poisson Distribution —
It is defined by the p.m.f., f(x) = e–m. mx / x! (x = 0, 1, 2, .., n), where m is
the ‘parameter’ of Poisson distribution (always +ve) and e is a mathematical
constant (2.718 app) given by the infinite series, e = 1 + (1/1!) + (1/2!) + ……
(2) Continuous Probability Distribution
Let x be a continuous random variable that can take any value in the
interval (a, b) i.e., a ≤ x ≤ b.
As the number of possible values of x is infinite, probabilities can not be
assigned to each one of its value; it is assigned to intervals. For a
continuous probability distribution, let f (x) be a non-negative function such
that — P (c ≤ x ≤ d) = d∫c f (x) dx

The function f (x) is called the probability density function (p.d.f.) or simply
density function of x. It satisfies two conditions —
a) f(x) ≥0 and
b) b∫a f (x) dx = 1.
The curve represented by the equation, y = f (x) is known as the probability
curve. Geometrically, the integral of the p.d.f. represents the area under the
probability curve between interval (c, d) in the range (a, b).

a) Uniform Distribution — If the probabilities associated with intervals of
equal length are equal at any part of the range, it is called a uniform
distribution.
It is defined by the p.d.f.
f(x) = 1/(b – a): a ≤ x ≤ b.
It is also called a rectangular distribution as the distribution looks like
rectangle of height 1/(b – a) over the range a ≤ x ≤ b.
b) Normal Distribution — It is called Gaussian distribution and is defined by
the p.d.f.,
f(x) = [1 / σ√(2π)]. e–(x –μ)2/2σ2 : –∞ < x < ∞
where, μ = mean, σ = standard deviation, π and e are mathematical
constants.
1) The probability curve of normal distribution is known as normal curve
which is bell-shaped and symmetrical about the line x = μ, and the two
tails extend indefinitely on either side.
2) The maximum ordinate lies at x = μ and is given by
y = 1/σ√(2π)

1. In Normal Distributions, mean = median = mode
2. Fractiles (e.g., quartiles, deciles, etc) are
equidistant from mean, i.e.,
quartile deviation = 0.67σ
mean deviation = 0.80σ
3. Skewness = 0, Kurtosis = 0
4. The points of inflection of the Normal Curve lie at
x = (μ ± σ)
5. The standard score of x is given by —
z = (x – μ)/σ
It has the p.d.f. of p(z) = [1/√(2π)]. e– z2/2
where –∞ < z < ∞
1. In geography, a huge data set is likely to follow
normal distribution.
2. In sampling theory, any statistic based on a large
sample approximately follow normal distribution,
thereby simplifying the testing of statistical
hypotheses and identifying the confidence limit
of parameters.
Area under Normal Curve

1. Barber, G. M. (1988): Elementary Statistics for Geographers, The Guilford Press, London
2. Berry, B. J. L. and Marble, D. F. (ed. 1968): Spatial Analysis - a reader in statistical geography,
Englewood Cliff, NJ
3. Clark, W. A. V. and P. Hosking (1986): Statistical Methods for Geographers, John Wiley, NY
4. Cressie, N. A. C. (1993): Statistics for Spatial Data, John Wiley & Sons, NY
5. Ebdon, D. (1977): Statistics in Geography - a practical approach, Basil Blackwell, Oxford
6. Gregory, S. (1963): Statistical Methods and the Geographer, Longman, London
7. Haggett, P., A. W. Cliff and A. Frey (1977): Locational Methods, Vol-I & II, Edward Arnold, London
8. Haggett, P. and R. J. Chorley (1969): Network Analysis in Geography, Edward Arnold, London
9. Hammond, R. and McCullagh, P. S. (1974): Quantitative Techniques in Geography, Claredon
Press, Oxford
10.Harvey, D. H. (1969): Explanation in Geography, Edward Arnold Pub., London
11.Johnston, R. J. (1978): Multivariate Statistical Analysis in Geography, New York : London
12.King, L. J. (1969): Statistical Analysis in Geography, Englewood Cliffs, NJ : Prentice Hall
13.Kitanidis, P. K. (1997): Introduction to Geostatistics, Cambridge University Press
List of Further Reading

14. Limb, M. and Dwyer, C (ed. 2001): Qualitative Methodologies for Geographers, London: Arnold
15. Lindsay, J. M. (1997): Techniques in Human Geography, Routledge
16. Matthews. M. H. and Foster, I. D. L. (1989) : Geographical Data - sources, presentation and
analysis, OUP
17. O’Brien, L (1992): Introducing Quantitative Geography, Routledge, London
18. Ripley, B. D. (1981): Spatial Statistics, Wiley, NY
19. Robinson, G. (1998): Methods and Techniques in Human Geography, Wiley, NY
20. Rogerson, P. A. (2001): Statistical Methods for Geography, Sage, London
21. Shaw, R. L. and Wheeler, D. (1985): Statistical Techniques in Geographical Analysis, John
Wiley & Sons, NY
22. Streich, T. A. (1986): Geographic Data Processing – an overview, California Univ. Press, Santa
Barbara
23. Taylor, P. J. (1977): Quantitative Methods in Geography – an introduction to spatial analysis,
Houghton Mifflin, Boston
24. Unwin, D. (1981): Introductory Spatial Analysis, New York : Methuen
25. Walford, N. (2002): Geographical Data – characteristics and sources, Wiley, NJ
26. Worthington, B. D. R. and R. Gont (1975): Techniques in Map Analysis, McMillan Ltd, London
27. Wrigley, N. and Bennett, R. J. (ed.1981): Quantitative Geography, Methuen, London
List of Further Reading

Thank You
Ethics in Statistical Geography
1. Be Honest with Data
Enumeration, Measurement and
Collection
2. Be Wise while selecting the
Statistical Technique(s) for your
intended Purpose
3. Explore the Results, observe the
Geographical Associations and
go for the Statistical Inferences
4. Be Precise and very Simple while
translating the “Language of
Statistics” into the “Language of
Geography”
Looking for a Publication in a Peer
Reviewed Journal?
Visit: www.indiansss.org
Indian Journal of Spatial Science
Contact:
editorijss2012@gmail.com

Statistical Methods Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical Methods Workshop

Similar to Statistical Methods Workshop (20)

More from Prof Ashis Sarkar

More from Prof Ashis Sarkar (20)

Recently uploaded

Recently uploaded (20)

Statistical Methods Workshop