A course work on R programming for basics to advance statistics and GIS.pdf
1. A course work on R programming for
basics to advance statistics and GIS
2. SEEMAB AKHTAR
1
PREFACE
R has been around since 1995 and has today become the most popular programming language
among data scientists around the word. It includes several data packages and functions which
makes it an attractive programming language for data scientists. R gives wonderful platform
in data analysis, data wrangling, data visualization, machine learning and open source. This
course covers traditional statistics to advance statistics and GIS applications of R, such as
models, graph descriptive statistics, mathematical trend modeling and spatial plot. R is
designed to be a tool that helps scientists for analyzing data and It has many excellent
functions that make plots and fit models to data. Because of this, many statisticians learn to
use R as if it were a piece of software; they discover which functions accomplish what they
need and ignore the rest.
Speaker & Instructor
SEEMAB AKHTAR
M.Tech (Mineral Exploration)
IIT (ISM) Dhanbad
M.Sc. (Applied Geology)
University of Allahabad
Email: akhtariitdhn@gmail.com
Social site: https://www.linkedin.com/in/seemab-akhtar-3b7856139/
YouTube: https://www.youtube.com/c/KnowledgeEducationHub
Specialization: Geostatistics, GIS & Groundwater resource management
Experience: Six years’ experience in the field of Geostatistics, GIS and Groundwater resource
management
3. SEEMAB AKHTAR
2
A course work on R programming for basics to advance statistics and GIS
Serial
No.
Contents Time
1 R and R studio installation, Packages 10:00 AM–
10:30 AM
2 Part 1 Basics statistics by R programming Starting 10:30
AM to 12:00
PM
1. The Fundamental of Descriptive Statistics
2. Box Plot, Bar Plot, Histogram Plot, QQ plot
3. Measures of Central Tendency (Mean, Median & Mode)
4.1 Skewed Plot
4.2 Normal Distribution
4.3 Standard Normal Distribution
4.4 Central limit Theorem
4.5 Different Statistical Error (Introduction)
5. P value (Introduction)
6. Regression Analysis
Part 2 Advanced Statistics by R Programming
3
1.1 Mathematical Polynomial Trend Surface Identification
1.2 Trend Removal
2.1 Mann Kendall Test
2.3 Sen’s Slope
Starting
12:30 PM to
3:30 PM
Part 3 GIS with R Programming
4 1. Polygonal Shape file, Line Shape File, Point Shape File
2. Clipping and Mask
3. Spatial Plot, Level plot
4. Countering
5. Image Stacking and Regression, Pixel-Pixel, box plots over
Raster surfaces
Starting 4:00
PM to 5:30 PM
4. SEEMAB AKHTAR
3
R and R Studio Installation
Install first R (4.1.0 version or above) after then install R studio from the website
(https://www.r-project.org/ & https://www.rstudio.com/products/rstudio/download/)
Code for packages installment
#install.packages(“package name”)
Or open R studio and click on install and type the packages name in the browser (figure 1)
(IDE interface of R studio)
5. SEEMAB AKHTAR
4
Preliminary requirement for R programming
Laptop or PC (4 GB RAM)
Good internet connection like Jio 4G volte
Code (This will be sent to all participant before starting the course)
Make a folder on your desktop gives name R
After the installation of R and R studio open the R studio (Integrated Development
Environment) interface.
Make a directory and gives the path for your folder name (R)
#setwd("C:UsersdellDesktopR")
Install the following packages
raster
rasterVis
zoo
xts
ppcor
rts
rgdal
spatialEco
Kendall
readr
readxl
gstat
sp
lattice
ggplot2
rgeos
spacetime
RColorBrewer
latticeExtra
map
if(!require(psych)){install.packages("psych")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(Rmisc)){install.packages("Rmisc")}
if(!require(FSA)){install.packages("FSA")}
if(!require(plyr)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}
6. SEEMAB AKHTAR
5
Part 1 Basics statistics by R programming
Descriptive Statistics
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
9. SEEMAB AKHTAR
8
Kurtosis
Kurtosis is a measure of the tailedness of a distribution. Tailedness is how
often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal
distribution.
Distributions with medium kurtosis (medium tails) are mesokurtic.
Distributions with low kurtosis (thin tails) are platykurtic.
Distributions with high kurtosis (fat tails) are leptokurtic.
Tails are the tapering ends on either side of a distribution. They represent the probability or
frequency of values that are extremely high or low compared to the mean. In other words, tails
represent how often outliers occur.
(Source https://www.scribbr.com/statistics/kurtosis/)
10. SEEMAB AKHTAR
9
Box and Whisker Plots
A Box and Whisker plot shows the five number summary of a set of DATA
QQ Plot
Q-Q (quantile-quantile) plots are used in statistics to graphically analyze and compare two
probability distributions by plotting their quantiles against each other. If the two distributions
under consideration are exactly equal, the points on the Q-Q plot will perfectly lie on a straight
line y = x. As a data scientist or statistician in general, you need to know whether the distribution
is normal or not in order to apply various statistical measures to the data and interpret it in a much
more human-understandable visual representation, which is where their Q-Q plot comes in. Q-Q
plots are used to determine the type of distribution for a random variable, such as a Gaussian
Distribution, Uniform Distribution, Exponential Distribution, or Normal Distribution. (Source:
https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
12. SEEMAB AKHTAR
11
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability
distribution that is symmetric about the mean, showing that data near the mean are more frequent
in occurrence than data far from the mean. In graphical form, the normal distribution appears as a
"bell curve".( Source: https://www.investopedia.com/terms/n/normaldistribution.asp)
Normality Test in R: A data set said to be normal distribution if its skewness is zero and kurtosis
is 3. There are four methods in R for testing the normality of any data set and these are-
1. (Visual Method) Create a histogram.
If the histogram is roughly “bell-shaped”, then the data is assumed to be normally
distributed.
2. (Visual Method) Create a Q-Q plot.
If the points in the plot roughly fall along a straight diagonal line, then the data is assumed
to be normally distributed.
3. (Formal Statistical Test) Perform a Shapiro-Wilk Test.
If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
4. (Formal Statistical Test) Perform a Kolmogorov-Smirnov Test.
If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
5. (Formal Statistical Test) Perform a Jarque-Bera Normality Test.
If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
13. SEEMAB AKHTAR
12
6. (Formal Statistical Test) Perform an Anderson-Darling Test.
An Anderson-Darling Test is a goodness of fit test that measures how well your data fit a
specified distribution. The null hypothesis for the A-D test is that the data does follow a
normal distribution. Thus, if our p-value for the test is below our significance level
(common choices are 0.10, 0.05, and 0.01), then we can reject the null hypothesis and
conclude that we have sufficient evidence to say our data does not follow a normal
distribution.
7. (Formal Statistical Test) Perform a Chi-Square of goodness of fit Test
The Chi-Square Test for Normality allows us to check whether or not a model or theory
follows an approximately distribution. To apply the Chi-Square Test for Normality to any
data set, let your null hypothesis be that your data is sampled from a normal distribution
and apply the Chi-Square Goodness of Fit Test. Given your mean and standard deviation,
you will need to calculate the expected values under the normal distribution for every data
point. Then use the formula-
to find the chi-square statistic. Compare this to the critical chi-square value from a chi-
square table, given your degrees of freedom and desired alpha level. If your chi-square
statistic is larger than the table value, you may conclude your data is not normal.
Reasons for the Non Normal Distribution
1. Outliers can cause your data become skewed. The mean is especially sensitive to outliers.
Try removing any extreme high or low values and testing your data again
2. Multiple distributions may be combined in your data, giving the appearance of
a bimodal or multimodal distribution. For example, two sets of normally distributed test
results are combined in the following image to give the appearance of bimodal data.
3. Insufficient Data can cause a normal distribution to look completely scattered.
14. SEEMAB AKHTAR
13
Dealing with Non Normal Distributions
You have several options for handling your non normal data. Many tests, including the one
sample Z test, T test and ANOVA assume normality. You may still be able to run these tests if
your sample size is large enough (usually over 20 items). You can also choose to transform the data
with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample
that is skewed or one that naturally fits another distribution type, you may want to run a non-
parametric test. A non-parametric test is one that doesn’t assume the data fits a specific distribution
type. Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and
the Kruskal-Wallis test.
(Source: https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/)
Standard Normal Distribution
The standard normal distribution, also called the z-distribution, is a special normal
distribution where the mean is 0 and the standard deviation is 1. Any normal distribution can be
standardized by converting its values into z-scores. Z-scores tell you how many standard
deviations from the mean each value lies.(Source: https://www.scribbr.com/statistics/standard-
normaldistribution/#:~:text=The%20standard%20normal%20distribution%2C%20also,the%20m
ean%20each%20value%20lies). The probability density function for the normal distribution
having mean μ and standard deviation σ is given by the function-
If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution-
15. SEEMAB AKHTAR
14
68%-95%-99.7% Rule
The 68% - 95% - 99.7% is a rule of thumb that allows practitioners of statistics to estimate
the probability that a randomly selected number from the standard normal distribution occurs
within 1, 2, and 3 standard deviations of the mean at zero.
(Source: https://mse.redwoods.edu/darnold/math15/UsingRInStatistics/StandardNormal.php)
Central Limit Theorem
The Central Limit Theorem tells us that the distribution of sample means x, of samples of
size n taken from any given population
1.Becomes more "normal" in shape as n increases;
2.Mean that agrees with the population mean, μ; and
3.Standard deviation equal to n/√σ, where σ is the standard deviation of the population
(Source http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem
16. SEEMAB AKHTAR
15
In this project, we will construct a population, and then approximate the distributions of sample
means for various sample sizes through repeated sampling, so that we can "see" this theorem in
action through a sequence of histograms -- as suggested by the below graphic
Types of Errors in Statistics
There are two types of error in statistics that is the type I & type II. In a statistical test,
the Type I error is the elimination of the true null theories. In contrast, the type II error is the non-
elimination of the false null hypothesis. Plenty of the statistical method rotates around the
reduction of one or both kind of errors, although the complete rejection of either of them is
impossible. But by choosing the low threshold value and changing the alpha level, the features of
the hypothesis test could be maximized. The information on type I error & type II error is used for
bio-metrics, medical science, and computer science. (https://statanalytica.com/blog/types-of-error-
in-statistics/).
19. SEEMAB AKHTAR
18
Part 2 Advanced Statistics by R Programming
Trend Surface Analysis
Trend Surface Analysis is the model used attempts to decompose each observation on a spatially
distributed variable into a component associated with any regional trends present in the data and a
component associated with purely local effects. This separation into two components is
accomplished by fitting a best-fit surface by using standard regression techniques. The values
predicted by this trend-surface are assigned to the regional effects whereas the local departures of
the observed data from it, or residuals, are assigned to the local effects. In order to plot values on
a map the geographer needs three pieces of information, the x, y spatial co-ordinates of each point
together with the heights above some datum and the z co-ordinate. The z values might relate to
any variable but the whole operation defines a spatial series in which the z observations are ordered
with respect to the two spatial co-ordinates, x and y. The map would be completed by drawing
lines of equal z value (contours or isolines) through the points. The resulting contour-type map
defines a complex surface which in most cases will reveal a spatial structure or form. A trend
surface analysis assumes that each mapped value can be decomposed into two components that
arise from two scales of process-
A) According to Krumbein and Graybill (1965) this trend is 'associated with 'large scale' systematic
changes that extend from one map edge to the other'. Similarly, Grant (1961) defines trend as '. . .
that part of the data that varies smoothly. In other words, it is the function that behaves predictably'.
B) The combined result of two processes that operate over an area substantially smaller than the
study area, random fluctuations and errors of measurement. This forms an assumed error, local
component, or residual defined by Krumbein and Graybill as '. . . apparently non-systematic
fluctuations that are superimposed on the large scale patterns'. It is important to notice that at the
scale of observation, these residuals appear to be spatially random; they may prove to be
systematically related to a spatial process but at this scale they do not vary systematically over the
mapped area.
20. SEEMAB AKHTAR
19
Mathematically-
observed value of trend component + residual at surface at that point
If component (A), the trend, varies smoothly over space its value (height) at any particular point
can be expressed in terms of the spatial co-ordinates of that point, so that the basic equation of any
trend analysis becomes
Zobs= f(Xi+Yi)+Ui
Zi= The observed value of the surface at the ith
point.
Xi= The co-ordinate on the x-axis (northing) of the ith data point
Yi= The co-ordinate on the x-axis (northing) of the ith data point.
Ui= The residual at the ith data point.
f denotes “some function”, and thus the term f(Xi, Yi) indicates the trend component. By function
we simply mean that if we know the location of any point i as a pair of spatial co-ordinate Xi and
Yi , then the height of the trend component at this point can be found by simply plugging these Xi
and Yi in to a known equation (or function). It follows that we can calculate a trend component
for any combination of X and Y that f(Xi+Yi) denotes a complete surface of trend components
called trend surface. Trend surface analysis is very important in a fundamental concept of
geostatistics when mathematically deal with the notion of spatial information that exhibit the areas
if massive values and one-of-a kind areas smaller values. Then a concept of geographical trend
arises which is related to the position of spatial data because geostatistical estimation would
assume stationarity and away from the spatial data would estimate with global mean (simple
kriging) (S R Vieira et al., 2009). The order of the stationary hypothesis can rely on the order of
the applied mathematics wished to be stagnant. Thus, once second order stationarity is needed, a
minimum of the second order variable (mean and variance) should be stationary. A collection of n
Values of Z (Xi), and the mean value E exist and doesn’t rely on the geographic location Xi,
where Xi are going to be inherent if it follows equation 1 & 2:
E{Z(Xi)} = m------------- (1)
21. SEEMAB AKHTAR
20
The augmentation [Z(Xi) –Z(Xi+h)] in variance is finite and does not rest on on the geographic
location Xi. This condition can be written in the form of equation as:
VAR[Z(Xi)-Z(Xi+h)] = E[Z(Xi)-Z(Xi+h)]2
---------------- (2)
Where m is the mean and h is the small increment in position Xi.
Order of the model
The order of the polynomial model is kept as low as possible. Some transformations can be used
to keep the model to be of the first order. If this is not satisfactory, then the second-order
polynomial is tried. Arbitrary fitting of higher-order polynomials can be a serious abuse of
regression analysis. A model which is consistent with the knowledge of data and its environment
should be taken into account. It is always possible for a polynomial of order n 1 to pass through
n points so that a polynomial of sufficiently high degree can always be found that provides a
“good” fit to the data. Such models neither enhance the understanding of the unknown function
nor be a good predictor.
First Degree Polynomial Function
First degree polynomials have terms with a maximum degree of 1. In other words, you
wouldn’t usually find any exponents in the terms of a first degree polynomial. For example,
the following are first degree polynomials:
2x + 1,
xyz + 50,
10a + 4b + 20.
Second Degree Polynomial Function
Second degree polynomials have at least one second degree term in the expression (e.g. 2x2
,
a2
, xyz2
). There are no higher terms (like x3
or abc5
). The quadratic function
f(x) = ax2
+ bx + c is an example of a second degree polynomial.
22. SEEMAB AKHTAR
21
General Form of Nth Degree Polynomial
The general form used to represent nth degree polynomial is,
P(x)=anxn
+an−1xn-1
+an−2xn-2
+....+ a1x1
+a0
Here, a0,a1,a2,...,ana0,a1,a2,...,an are the coefficients that take numerical values as their
inputs, xx is the variable, and nn is the degree of the polynomial, which is a whole number.
Mathematical Polynomial Surface Trend Identification
(Source: David J. Unwin, 1978)
Third Degree Polynomial Function
A cubic function (or third-degree polynomial) can be written as:
where a, b, c, and d are constants terms, and a is nonzero.
24. SEEMAB AKHTAR
23
Mann Kendall Test on Time Series Data
The Mann Kendall Trend Test (sometimes called the M-K test) is used to analyze data
collected over time for consistently increasing or decreasing trends (monotonic) in Y values. It is
a non-parametric test, which means it works for all distributions (i.e. your data doesn’t have to
meet the assumption of normality), but your data should have no serial correlation. Sen’s slope is
a process for estimating the slope of trend in a sample of n pairs of data. It is a linear model Y(t)
and can be described as Y(t)= Mt+C where M is the slope and C is a constant. Slopes of all data
pairs of the slopes M (Sen’s slope) are calculated as:
𝑀 =
𝑋𝑖 − 𝑋𝑘
𝑗 − 𝑘
Where i= 1,2,3…………………...N, j>k.
Median of the N values of Mi and the N values of Mi are ranked from the smallest to the largest is
called Sen’s slope. The confidence interval about the slope (Gilbert 1987) can be calculated as:
C.I.α= Z1- α/2√Var(s)
Var(s) is defined in equation (3) and Z1- α/2 is estimated from the standard normal distribution table
25. SEEMAB AKHTAR
24
Part 3 GIS with R Programming
Polygonal Shape file, Line Shape File, Point Shape File & Contouring
In this section, we will open and plot point, line, counter and polygon vector data stored
in shape file format in R.
Clipping and Crop
This tutorial will guide you in a step-by-step process to mask and crop a raster from shape
file in R.
(Source: https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm)
26. SEEMAB AKHTAR
25
Spatial Plot, Level plot
In this part we will see an introduction to analyzing spatial data in R, specifically through
map-making with R’s ‘base’ graphics and various dedicated map-making packages. It teaches the
basics of using R as a fast, user-friendly and extremely powerful command-line Geographic
Information System (GIS).
Image Stacking and Regression pixel to pixel and box plots over Raster
In this part we will learn about the Image stacking of different raster band, Regression and
box plots of raster surfaces.
Image stacking of raster Surfaces
27. SEEMAB AKHTAR
26
References
Drapela K. & Drapelova I. 2011. “Application of Mann-Kendall test and Sen’s slope estimates for
trend detection in deposition data from Biky Kriz, Mendelova Univerzita V Brne, Beskydy.” Vol
4(2), pp 133-146.
Gilbert, R. O. 1988. “Statistical Methods for Environmental Pollution Monitoring.” Biometrics 44(1): 319.
https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm)
https://statanalytica.com/blog/types-of-error-in-statistics/
http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem
https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0
https://www.investopedia.com/terms/n/normaldistribution.asp
https://www.r-project.org/
https://www.rstudio.com/products/rstudio/download/
https://www.udemy.com/user/sandeepkumar1/
Kampata, J. M., B. P. Parida, and D. B. Moalafhi. 2008. “Trend Analysis of Rainfall in the
Headstreams of the Zambezi River Basin in Zambia.” Physics and Chemistry of the Earth 33(8–
13): 621–25.
Pyrcz, M. J., & Deutsch, C. V. (2014). Geostatistical reservoir modeling. Oxford university press.
Silva, Richarde Marques et al. 2015. “Rainfall and River Flow Trends Using Mann–Kendall and
Sen’s Slope Estimator Statistical Tests in the Cobres River Basin.” Natural Hazards 77(2): 1205–
21.
Vekaria, R. M., Shirley, D. G., Sévigny, J., & Unwin, R. J. (2006). Immunolocalization of
ectonucleotidases along the rat nephron. American Journal of Physiology-Renal
Physiology, 290(2), F550-F560.
Vieira, Sidney Rosa, José Ruy Porto de Carvalho, Marcos Bacis Ceddia, and Antonio Paz
González. 2010. “Detrending Non Stationary Data for Geostatistical Applications.” Bragantia
69(suppl): 01–08.
Wu, Chunfa et al. 2011. “Spatial Interpolation of Severely Skewed Data with Several Peak Values
by the Approach Integrating Kriging and Triangular Irregular Network Interpolation.”
Environmental Earth Sciences 63(5): 1093–1103.