SlideShare a Scribd company logo
1 of 27
Download to read offline
A course work on R programming for
basics to advance statistics and GIS
SEEMAB AKHTAR
1
PREFACE
R has been around since 1995 and has today become the most popular programming language
among data scientists around the word. It includes several data packages and functions which
makes it an attractive programming language for data scientists. R gives wonderful platform
in data analysis, data wrangling, data visualization, machine learning and open source. This
course covers traditional statistics to advance statistics and GIS applications of R, such as
models, graph descriptive statistics, mathematical trend modeling and spatial plot. R is
designed to be a tool that helps scientists for analyzing data and It has many excellent
functions that make plots and fit models to data. Because of this, many statisticians learn to
use R as if it were a piece of software; they discover which functions accomplish what they
need and ignore the rest.
Speaker & Instructor
SEEMAB AKHTAR
M.Tech (Mineral Exploration)
IIT (ISM) Dhanbad
M.Sc. (Applied Geology)
University of Allahabad
Email: akhtariitdhn@gmail.com
Social site: https://www.linkedin.com/in/seemab-akhtar-3b7856139/
YouTube: https://www.youtube.com/c/KnowledgeEducationHub
Specialization: Geostatistics, GIS & Groundwater resource management
Experience: Six years’ experience in the field of Geostatistics, GIS and Groundwater resource
management
SEEMAB AKHTAR
2
A course work on R programming for basics to advance statistics and GIS
Serial
No.
Contents Time
1 R and R studio installation, Packages 10:00 AM–
10:30 AM
2 Part 1 Basics statistics by R programming Starting 10:30
AM to 12:00
PM
1. The Fundamental of Descriptive Statistics
2. Box Plot, Bar Plot, Histogram Plot, QQ plot
3. Measures of Central Tendency (Mean, Median & Mode)
4.1 Skewed Plot
4.2 Normal Distribution
4.3 Standard Normal Distribution
4.4 Central limit Theorem
4.5 Different Statistical Error (Introduction)
5. P value (Introduction)
6. Regression Analysis
Part 2 Advanced Statistics by R Programming
3
1.1 Mathematical Polynomial Trend Surface Identification
1.2 Trend Removal
2.1 Mann Kendall Test
2.3 Sen’s Slope
Starting
12:30 PM to
3:30 PM
Part 3 GIS with R Programming
4 1. Polygonal Shape file, Line Shape File, Point Shape File
2. Clipping and Mask
3. Spatial Plot, Level plot
4. Countering
5. Image Stacking and Regression, Pixel-Pixel, box plots over
Raster surfaces
Starting 4:00
PM to 5:30 PM
SEEMAB AKHTAR
3
R and R Studio Installation
Install first R (4.1.0 version or above) after then install R studio from the website
(https://www.r-project.org/ & https://www.rstudio.com/products/rstudio/download/)
Code for packages installment
#install.packages(“package name”)
Or open R studio and click on install and type the packages name in the browser (figure 1)
(IDE interface of R studio)
SEEMAB AKHTAR
4
Preliminary requirement for R programming
 Laptop or PC (4 GB RAM)
 Good internet connection like Jio 4G volte
 Code (This will be sent to all participant before starting the course)
 Make a folder on your desktop gives name R
 After the installation of R and R studio open the R studio (Integrated Development
Environment) interface.
 Make a directory and gives the path for your folder name (R)
 #setwd("C:UsersdellDesktopR")
Install the following packages
 raster
 rasterVis
 zoo
 xts
 ppcor
 rts
 rgdal
 spatialEco
 Kendall
 readr
 readxl
 gstat
 sp
 lattice
 ggplot2
 rgeos
 spacetime
 RColorBrewer
 latticeExtra
 map
 if(!require(psych)){install.packages("psych")}
 if(!require(DescTools)){install.packages("DescTools")}
 if(!require(Rmisc)){install.packages("Rmisc")}
 if(!require(FSA)){install.packages("FSA")}
 if(!require(plyr)){install.packages("plyr")}
 if(!require(boot)){install.packages("boot")}
SEEMAB AKHTAR
5
Part 1 Basics statistics by R programming
Descriptive Statistics
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
SEEMAB AKHTAR
6
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
SEEMAB AKHTAR
7
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
SEEMAB AKHTAR
8
Kurtosis
Kurtosis is a measure of the tailedness of a distribution. Tailedness is how
often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal
distribution.
 Distributions with medium kurtosis (medium tails) are mesokurtic.
 Distributions with low kurtosis (thin tails) are platykurtic.
 Distributions with high kurtosis (fat tails) are leptokurtic.
Tails are the tapering ends on either side of a distribution. They represent the probability or
frequency of values that are extremely high or low compared to the mean. In other words, tails
represent how often outliers occur.
(Source https://www.scribbr.com/statistics/kurtosis/)
SEEMAB AKHTAR
9
Box and Whisker Plots
A Box and Whisker plot shows the five number summary of a set of DATA
QQ Plot
Q-Q (quantile-quantile) plots are used in statistics to graphically analyze and compare two
probability distributions by plotting their quantiles against each other. If the two distributions
under consideration are exactly equal, the points on the Q-Q plot will perfectly lie on a straight
line y = x. As a data scientist or statistician in general, you need to know whether the distribution
is normal or not in order to apply various statistical measures to the data and interpret it in a much
more human-understandable visual representation, which is where their Q-Q plot comes in. Q-Q
plots are used to determine the type of distribution for a random variable, such as a Gaussian
Distribution, Uniform Distribution, Exponential Distribution, or Normal Distribution. (Source:
https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
SEEMAB AKHTAR
10
(Source: https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
SEEMAB AKHTAR
11
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability
distribution that is symmetric about the mean, showing that data near the mean are more frequent
in occurrence than data far from the mean. In graphical form, the normal distribution appears as a
"bell curve".( Source: https://www.investopedia.com/terms/n/normaldistribution.asp)
Normality Test in R: A data set said to be normal distribution if its skewness is zero and kurtosis
is 3. There are four methods in R for testing the normality of any data set and these are-
1. (Visual Method) Create a histogram.
 If the histogram is roughly “bell-shaped”, then the data is assumed to be normally
distributed.
2. (Visual Method) Create a Q-Q plot.
 If the points in the plot roughly fall along a straight diagonal line, then the data is assumed
to be normally distributed.
3. (Formal Statistical Test) Perform a Shapiro-Wilk Test.
 If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
4. (Formal Statistical Test) Perform a Kolmogorov-Smirnov Test.
 If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
5. (Formal Statistical Test) Perform a Jarque-Bera Normality Test.
 If the p-value of the test is greater than α = .05, then the data is assumed to be normally
distributed.
SEEMAB AKHTAR
12
6. (Formal Statistical Test) Perform an Anderson-Darling Test.
 An Anderson-Darling Test is a goodness of fit test that measures how well your data fit a
specified distribution. The null hypothesis for the A-D test is that the data does follow a
normal distribution. Thus, if our p-value for the test is below our significance level
(common choices are 0.10, 0.05, and 0.01), then we can reject the null hypothesis and
conclude that we have sufficient evidence to say our data does not follow a normal
distribution.
7. (Formal Statistical Test) Perform a Chi-Square of goodness of fit Test
 The Chi-Square Test for Normality allows us to check whether or not a model or theory
follows an approximately distribution. To apply the Chi-Square Test for Normality to any
data set, let your null hypothesis be that your data is sampled from a normal distribution
and apply the Chi-Square Goodness of Fit Test. Given your mean and standard deviation,
you will need to calculate the expected values under the normal distribution for every data
point. Then use the formula-
to find the chi-square statistic. Compare this to the critical chi-square value from a chi-
square table, given your degrees of freedom and desired alpha level. If your chi-square
statistic is larger than the table value, you may conclude your data is not normal.
Reasons for the Non Normal Distribution
1. Outliers can cause your data become skewed. The mean is especially sensitive to outliers.
Try removing any extreme high or low values and testing your data again
2. Multiple distributions may be combined in your data, giving the appearance of
a bimodal or multimodal distribution. For example, two sets of normally distributed test
results are combined in the following image to give the appearance of bimodal data.
3. Insufficient Data can cause a normal distribution to look completely scattered.
SEEMAB AKHTAR
13
Dealing with Non Normal Distributions
You have several options for handling your non normal data. Many tests, including the one
sample Z test, T test and ANOVA assume normality. You may still be able to run these tests if
your sample size is large enough (usually over 20 items). You can also choose to transform the data
with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample
that is skewed or one that naturally fits another distribution type, you may want to run a non-
parametric test. A non-parametric test is one that doesn’t assume the data fits a specific distribution
type. Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and
the Kruskal-Wallis test.
(Source: https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/)
Standard Normal Distribution
The standard normal distribution, also called the z-distribution, is a special normal
distribution where the mean is 0 and the standard deviation is 1. Any normal distribution can be
standardized by converting its values into z-scores. Z-scores tell you how many standard
deviations from the mean each value lies.(Source: https://www.scribbr.com/statistics/standard-
normaldistribution/#:~:text=The%20standard%20normal%20distribution%2C%20also,the%20m
ean%20each%20value%20lies). The probability density function for the normal distribution
having mean μ and standard deviation σ is given by the function-
If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution-
SEEMAB AKHTAR
14
68%-95%-99.7% Rule
The 68% - 95% - 99.7% is a rule of thumb that allows practitioners of statistics to estimate
the probability that a randomly selected number from the standard normal distribution occurs
within 1, 2, and 3 standard deviations of the mean at zero.
(Source: https://mse.redwoods.edu/darnold/math15/UsingRInStatistics/StandardNormal.php)
Central Limit Theorem
The Central Limit Theorem tells us that the distribution of sample means x, of samples of
size n taken from any given population
1.Becomes more "normal" in shape as n increases;
2.Mean that agrees with the population mean, μ; and
3.Standard deviation equal to n/√σ, where σ is the standard deviation of the population
(Source http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem
SEEMAB AKHTAR
15
In this project, we will construct a population, and then approximate the distributions of sample
means for various sample sizes through repeated sampling, so that we can "see" this theorem in
action through a sequence of histograms -- as suggested by the below graphic
Types of Errors in Statistics
There are two types of error in statistics that is the type I & type II. In a statistical test,
the Type I error is the elimination of the true null theories. In contrast, the type II error is the non-
elimination of the false null hypothesis. Plenty of the statistical method rotates around the
reduction of one or both kind of errors, although the complete rejection of either of them is
impossible. But by choosing the low threshold value and changing the alpha level, the features of
the hypothesis test could be maximized. The information on type I error & type II error is used for
bio-metrics, medical science, and computer science. (https://statanalytica.com/blog/types-of-error-
in-statistics/).
SEEMAB AKHTAR
16
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
SEEMAB AKHTAR
17
(Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
(Source: https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
SEEMAB AKHTAR
18
Part 2 Advanced Statistics by R Programming
Trend Surface Analysis
Trend Surface Analysis is the model used attempts to decompose each observation on a spatially
distributed variable into a component associated with any regional trends present in the data and a
component associated with purely local effects. This separation into two components is
accomplished by fitting a best-fit surface by using standard regression techniques. The values
predicted by this trend-surface are assigned to the regional effects whereas the local departures of
the observed data from it, or residuals, are assigned to the local effects. In order to plot values on
a map the geographer needs three pieces of information, the x, y spatial co-ordinates of each point
together with the heights above some datum and the z co-ordinate. The z values might relate to
any variable but the whole operation defines a spatial series in which the z observations are ordered
with respect to the two spatial co-ordinates, x and y. The map would be completed by drawing
lines of equal z value (contours or isolines) through the points. The resulting contour-type map
defines a complex surface which in most cases will reveal a spatial structure or form. A trend
surface analysis assumes that each mapped value can be decomposed into two components that
arise from two scales of process-
A) According to Krumbein and Graybill (1965) this trend is 'associated with 'large scale' systematic
changes that extend from one map edge to the other'. Similarly, Grant (1961) defines trend as '. . .
that part of the data that varies smoothly. In other words, it is the function that behaves predictably'.
B) The combined result of two processes that operate over an area substantially smaller than the
study area, random fluctuations and errors of measurement. This forms an assumed error, local
component, or residual defined by Krumbein and Graybill as '. . . apparently non-systematic
fluctuations that are superimposed on the large scale patterns'. It is important to notice that at the
scale of observation, these residuals appear to be spatially random; they may prove to be
systematically related to a spatial process but at this scale they do not vary systematically over the
mapped area.
SEEMAB AKHTAR
19
Mathematically-
observed value of trend component + residual at surface at that point
If component (A), the trend, varies smoothly over space its value (height) at any particular point
can be expressed in terms of the spatial co-ordinates of that point, so that the basic equation of any
trend analysis becomes
Zobs= f(Xi+Yi)+Ui
Zi= The observed value of the surface at the ith
point.
Xi= The co-ordinate on the x-axis (northing) of the ith data point
Yi= The co-ordinate on the x-axis (northing) of the ith data point.
Ui= The residual at the ith data point.
f denotes “some function”, and thus the term f(Xi, Yi) indicates the trend component. By function
we simply mean that if we know the location of any point i as a pair of spatial co-ordinate Xi and
Yi , then the height of the trend component at this point can be found by simply plugging these Xi
and Yi in to a known equation (or function). It follows that we can calculate a trend component
for any combination of X and Y that f(Xi+Yi) denotes a complete surface of trend components
called trend surface. Trend surface analysis is very important in a fundamental concept of
geostatistics when mathematically deal with the notion of spatial information that exhibit the areas
if massive values and one-of-a kind areas smaller values. Then a concept of geographical trend
arises which is related to the position of spatial data because geostatistical estimation would
assume stationarity and away from the spatial data would estimate with global mean (simple
kriging) (S R Vieira et al., 2009). The order of the stationary hypothesis can rely on the order of
the applied mathematics wished to be stagnant. Thus, once second order stationarity is needed, a
minimum of the second order variable (mean and variance) should be stationary. A collection of n
Values of Z (Xi), and the mean value E exist and doesn’t rely on the geographic location Xi,
where Xi are going to be inherent if it follows equation 1 & 2:
E{Z(Xi)} = m------------- (1)
SEEMAB AKHTAR
20
The augmentation [Z(Xi) –Z(Xi+h)] in variance is finite and does not rest on on the geographic
location Xi. This condition can be written in the form of equation as:
VAR[Z(Xi)-Z(Xi+h)] = E[Z(Xi)-Z(Xi+h)]2
---------------- (2)
Where m is the mean and h is the small increment in position Xi.
Order of the model
The order of the polynomial model is kept as low as possible. Some transformations can be used
to keep the model to be of the first order. If this is not satisfactory, then the second-order
polynomial is tried. Arbitrary fitting of higher-order polynomials can be a serious abuse of
regression analysis. A model which is consistent with the knowledge of data and its environment
should be taken into account. It is always possible for a polynomial of order n 1 to pass through
n points so that a polynomial of sufficiently high degree can always be found that provides a
“good” fit to the data. Such models neither enhance the understanding of the unknown function
nor be a good predictor.
First Degree Polynomial Function
First degree polynomials have terms with a maximum degree of 1. In other words, you
wouldn’t usually find any exponents in the terms of a first degree polynomial. For example,
the following are first degree polynomials:
 2x + 1,
 xyz + 50,
 10a + 4b + 20.
Second Degree Polynomial Function
Second degree polynomials have at least one second degree term in the expression (e.g. 2x2
,
a2
, xyz2
). There are no higher terms (like x3
or abc5
). The quadratic function
f(x) = ax2
+ bx + c is an example of a second degree polynomial.
SEEMAB AKHTAR
21
General Form of Nth Degree Polynomial
The general form used to represent nth degree polynomial is,
P(x)=anxn
+an−1xn-1
+an−2xn-2
+....+ a1x1
+a0
Here, a0,a1,a2,...,ana0,a1,a2,...,an are the coefficients that take numerical values as their
inputs, xx is the variable, and nn is the degree of the polynomial, which is a whole number.
Mathematical Polynomial Surface Trend Identification
(Source: David J. Unwin, 1978)
Third Degree Polynomial Function
A cubic function (or third-degree polynomial) can be written as:
where a, b, c, and d are constants terms, and a is nonzero.
SEEMAB AKHTAR
22
SEEMAB AKHTAR
23
Mann Kendall Test on Time Series Data
The Mann Kendall Trend Test (sometimes called the M-K test) is used to analyze data
collected over time for consistently increasing or decreasing trends (monotonic) in Y values. It is
a non-parametric test, which means it works for all distributions (i.e. your data doesn’t have to
meet the assumption of normality), but your data should have no serial correlation. Sen’s slope is
a process for estimating the slope of trend in a sample of n pairs of data. It is a linear model Y(t)
and can be described as Y(t)= Mt+C where M is the slope and C is a constant. Slopes of all data
pairs of the slopes M (Sen’s slope) are calculated as:
𝑀 =
𝑋𝑖 − 𝑋𝑘
𝑗 − 𝑘
Where i= 1,2,3…………………...N, j>k.
Median of the N values of Mi and the N values of Mi are ranked from the smallest to the largest is
called Sen’s slope. The confidence interval about the slope (Gilbert 1987) can be calculated as:
C.I.α= Z1- α/2√Var(s)
Var(s) is defined in equation (3) and Z1- α/2 is estimated from the standard normal distribution table
SEEMAB AKHTAR
24
Part 3 GIS with R Programming
Polygonal Shape file, Line Shape File, Point Shape File & Contouring
In this section, we will open and plot point, line, counter and polygon vector data stored
in shape file format in R.
Clipping and Crop
This tutorial will guide you in a step-by-step process to mask and crop a raster from shape
file in R.
(Source: https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm)
SEEMAB AKHTAR
25
Spatial Plot, Level plot
In this part we will see an introduction to analyzing spatial data in R, specifically through
map-making with R’s ‘base’ graphics and various dedicated map-making packages. It teaches the
basics of using R as a fast, user-friendly and extremely powerful command-line Geographic
Information System (GIS).
Image Stacking and Regression pixel to pixel and box plots over Raster
In this part we will learn about the Image stacking of different raster band, Regression and
box plots of raster surfaces.
Image stacking of raster Surfaces
SEEMAB AKHTAR
26
References
Drapela K. & Drapelova I. 2011. “Application of Mann-Kendall test and Sen’s slope estimates for
trend detection in deposition data from Biky Kriz, Mendelova Univerzita V Brne, Beskydy.” Vol
4(2), pp 133-146.
Gilbert, R. O. 1988. “Statistical Methods for Environmental Pollution Monitoring.” Biometrics 44(1): 319.
https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm)
https://statanalytica.com/blog/types-of-error-in-statistics/
http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem
https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0
https://www.investopedia.com/terms/n/normaldistribution.asp
https://www.r-project.org/
https://www.rstudio.com/products/rstudio/download/
https://www.udemy.com/user/sandeepkumar1/
Kampata, J. M., B. P. Parida, and D. B. Moalafhi. 2008. “Trend Analysis of Rainfall in the
Headstreams of the Zambezi River Basin in Zambia.” Physics and Chemistry of the Earth 33(8–
13): 621–25.
Pyrcz, M. J., & Deutsch, C. V. (2014). Geostatistical reservoir modeling. Oxford university press.
Silva, Richarde Marques et al. 2015. “Rainfall and River Flow Trends Using Mann–Kendall and
Sen’s Slope Estimator Statistical Tests in the Cobres River Basin.” Natural Hazards 77(2): 1205–
21.
Vekaria, R. M., Shirley, D. G., Sévigny, J., & Unwin, R. J. (2006). Immunolocalization of
ectonucleotidases along the rat nephron. American Journal of Physiology-Renal
Physiology, 290(2), F550-F560.
Vieira, Sidney Rosa, José Ruy Porto de Carvalho, Marcos Bacis Ceddia, and Antonio Paz
González. 2010. “Detrending Non Stationary Data for Geostatistical Applications.” Bragantia
69(suppl): 01–08.
Wu, Chunfa et al. 2011. “Spatial Interpolation of Severely Skewed Data with Several Peak Values
by the Approach Integrating Kriging and Triangular Irregular Network Interpolation.”
Environmental Earth Sciences 63(5): 1093–1103.

More Related Content

Similar to A course work on R programming for basics to advance statistics and GIS.pdf

Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysisRaman Kannan
 
Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisQuynh Tran
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsColleen Farrelly
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
difference between dynamic programming and divide and conquer
difference between dynamic programming and divide and conquerdifference between dynamic programming and divide and conquer
difference between dynamic programming and divide and conquerSRISHTISRIVASTAVA212
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...RAHUL WAGAJ
 
Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006arnitaetsitty
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanPyData
 
4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdfssuser47ab7b2
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Detection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachDetection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachEditor IJMTER
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingProf. Wim Van Criekinge
 

Similar to A course work on R programming for basics to advance statistics and GIS.pdf (20)

Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and Analysis
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
difference between dynamic programming and divide and conquer
difference between dynamic programming and divide and conquerdifference between dynamic programming and divide and conquer
difference between dynamic programming and divide and conquer
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...
 
Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006Edu 8006 8 assignment edu8006
Edu 8006 8 assignment edu8006
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
icpr_2012
icpr_2012icpr_2012
icpr_2012
 
Pca seminar final report
Pca seminar final reportPca seminar final report
Pca seminar final report
 
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark SeligmanAccelerating the Random Forest algorithm for commodity parallel- Mark Seligman
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
 
4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf4-Introduction to Machine Learning Lecture # 4.pdf
4-Introduction to Machine Learning Lecture # 4.pdf
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Detection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachDetection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed Approach
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

A course work on R programming for basics to advance statistics and GIS.pdf

  • 1. A course work on R programming for basics to advance statistics and GIS
  • 2. SEEMAB AKHTAR 1 PREFACE R has been around since 1995 and has today become the most popular programming language among data scientists around the word. It includes several data packages and functions which makes it an attractive programming language for data scientists. R gives wonderful platform in data analysis, data wrangling, data visualization, machine learning and open source. This course covers traditional statistics to advance statistics and GIS applications of R, such as models, graph descriptive statistics, mathematical trend modeling and spatial plot. R is designed to be a tool that helps scientists for analyzing data and It has many excellent functions that make plots and fit models to data. Because of this, many statisticians learn to use R as if it were a piece of software; they discover which functions accomplish what they need and ignore the rest. Speaker & Instructor SEEMAB AKHTAR M.Tech (Mineral Exploration) IIT (ISM) Dhanbad M.Sc. (Applied Geology) University of Allahabad Email: akhtariitdhn@gmail.com Social site: https://www.linkedin.com/in/seemab-akhtar-3b7856139/ YouTube: https://www.youtube.com/c/KnowledgeEducationHub Specialization: Geostatistics, GIS & Groundwater resource management Experience: Six years’ experience in the field of Geostatistics, GIS and Groundwater resource management
  • 3. SEEMAB AKHTAR 2 A course work on R programming for basics to advance statistics and GIS Serial No. Contents Time 1 R and R studio installation, Packages 10:00 AM– 10:30 AM 2 Part 1 Basics statistics by R programming Starting 10:30 AM to 12:00 PM 1. The Fundamental of Descriptive Statistics 2. Box Plot, Bar Plot, Histogram Plot, QQ plot 3. Measures of Central Tendency (Mean, Median & Mode) 4.1 Skewed Plot 4.2 Normal Distribution 4.3 Standard Normal Distribution 4.4 Central limit Theorem 4.5 Different Statistical Error (Introduction) 5. P value (Introduction) 6. Regression Analysis Part 2 Advanced Statistics by R Programming 3 1.1 Mathematical Polynomial Trend Surface Identification 1.2 Trend Removal 2.1 Mann Kendall Test 2.3 Sen’s Slope Starting 12:30 PM to 3:30 PM Part 3 GIS with R Programming 4 1. Polygonal Shape file, Line Shape File, Point Shape File 2. Clipping and Mask 3. Spatial Plot, Level plot 4. Countering 5. Image Stacking and Regression, Pixel-Pixel, box plots over Raster surfaces Starting 4:00 PM to 5:30 PM
  • 4. SEEMAB AKHTAR 3 R and R Studio Installation Install first R (4.1.0 version or above) after then install R studio from the website (https://www.r-project.org/ & https://www.rstudio.com/products/rstudio/download/) Code for packages installment #install.packages(“package name”) Or open R studio and click on install and type the packages name in the browser (figure 1) (IDE interface of R studio)
  • 5. SEEMAB AKHTAR 4 Preliminary requirement for R programming  Laptop or PC (4 GB RAM)  Good internet connection like Jio 4G volte  Code (This will be sent to all participant before starting the course)  Make a folder on your desktop gives name R  After the installation of R and R studio open the R studio (Integrated Development Environment) interface.  Make a directory and gives the path for your folder name (R)  #setwd("C:UsersdellDesktopR") Install the following packages  raster  rasterVis  zoo  xts  ppcor  rts  rgdal  spatialEco  Kendall  readr  readxl  gstat  sp  lattice  ggplot2  rgeos  spacetime  RColorBrewer  latticeExtra  map  if(!require(psych)){install.packages("psych")}  if(!require(DescTools)){install.packages("DescTools")}  if(!require(Rmisc)){install.packages("Rmisc")}  if(!require(FSA)){install.packages("FSA")}  if(!require(plyr)){install.packages("plyr")}  if(!require(boot)){install.packages("boot")}
  • 6. SEEMAB AKHTAR 5 Part 1 Basics statistics by R programming Descriptive Statistics (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/) (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
  • 7. SEEMAB AKHTAR 6 (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/) (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
  • 8. SEEMAB AKHTAR 7 (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
  • 9. SEEMAB AKHTAR 8 Kurtosis Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal distribution.  Distributions with medium kurtosis (medium tails) are mesokurtic.  Distributions with low kurtosis (thin tails) are platykurtic.  Distributions with high kurtosis (fat tails) are leptokurtic. Tails are the tapering ends on either side of a distribution. They represent the probability or frequency of values that are extremely high or low compared to the mean. In other words, tails represent how often outliers occur. (Source https://www.scribbr.com/statistics/kurtosis/)
  • 10. SEEMAB AKHTAR 9 Box and Whisker Plots A Box and Whisker plot shows the five number summary of a set of DATA QQ Plot Q-Q (quantile-quantile) plots are used in statistics to graphically analyze and compare two probability distributions by plotting their quantiles against each other. If the two distributions under consideration are exactly equal, the points on the Q-Q plot will perfectly lie on a straight line y = x. As a data scientist or statistician in general, you need to know whether the distribution is normal or not in order to apply various statistical measures to the data and interpret it in a much more human-understandable visual representation, which is where their Q-Q plot comes in. Q-Q plots are used to determine the type of distribution for a random variable, such as a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or Normal Distribution. (Source: https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
  • 12. SEEMAB AKHTAR 11 Normal Distribution Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graphical form, the normal distribution appears as a "bell curve".( Source: https://www.investopedia.com/terms/n/normaldistribution.asp) Normality Test in R: A data set said to be normal distribution if its skewness is zero and kurtosis is 3. There are four methods in R for testing the normality of any data set and these are- 1. (Visual Method) Create a histogram.  If the histogram is roughly “bell-shaped”, then the data is assumed to be normally distributed. 2. (Visual Method) Create a Q-Q plot.  If the points in the plot roughly fall along a straight diagonal line, then the data is assumed to be normally distributed. 3. (Formal Statistical Test) Perform a Shapiro-Wilk Test.  If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed. 4. (Formal Statistical Test) Perform a Kolmogorov-Smirnov Test.  If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed. 5. (Formal Statistical Test) Perform a Jarque-Bera Normality Test.  If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed.
  • 13. SEEMAB AKHTAR 12 6. (Formal Statistical Test) Perform an Anderson-Darling Test.  An Anderson-Darling Test is a goodness of fit test that measures how well your data fit a specified distribution. The null hypothesis for the A-D test is that the data does follow a normal distribution. Thus, if our p-value for the test is below our significance level (common choices are 0.10, 0.05, and 0.01), then we can reject the null hypothesis and conclude that we have sufficient evidence to say our data does not follow a normal distribution. 7. (Formal Statistical Test) Perform a Chi-Square of goodness of fit Test  The Chi-Square Test for Normality allows us to check whether or not a model or theory follows an approximately distribution. To apply the Chi-Square Test for Normality to any data set, let your null hypothesis be that your data is sampled from a normal distribution and apply the Chi-Square Goodness of Fit Test. Given your mean and standard deviation, you will need to calculate the expected values under the normal distribution for every data point. Then use the formula- to find the chi-square statistic. Compare this to the critical chi-square value from a chi- square table, given your degrees of freedom and desired alpha level. If your chi-square statistic is larger than the table value, you may conclude your data is not normal. Reasons for the Non Normal Distribution 1. Outliers can cause your data become skewed. The mean is especially sensitive to outliers. Try removing any extreme high or low values and testing your data again 2. Multiple distributions may be combined in your data, giving the appearance of a bimodal or multimodal distribution. For example, two sets of normally distributed test results are combined in the following image to give the appearance of bimodal data. 3. Insufficient Data can cause a normal distribution to look completely scattered.
  • 14. SEEMAB AKHTAR 13 Dealing with Non Normal Distributions You have several options for handling your non normal data. Many tests, including the one sample Z test, T test and ANOVA assume normality. You may still be able to run these tests if your sample size is large enough (usually over 20 items). You can also choose to transform the data with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you may want to run a non- parametric test. A non-parametric test is one that doesn’t assume the data fits a specific distribution type. Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test. (Source: https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/) Standard Normal Distribution The standard normal distribution, also called the z-distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1. Any normal distribution can be standardized by converting its values into z-scores. Z-scores tell you how many standard deviations from the mean each value lies.(Source: https://www.scribbr.com/statistics/standard- normaldistribution/#:~:text=The%20standard%20normal%20distribution%2C%20also,the%20m ean%20each%20value%20lies). The probability density function for the normal distribution having mean μ and standard deviation σ is given by the function- If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in Figure 1, we get the probability density function for the standard normal distribution-
  • 15. SEEMAB AKHTAR 14 68%-95%-99.7% Rule The 68% - 95% - 99.7% is a rule of thumb that allows practitioners of statistics to estimate the probability that a randomly selected number from the standard normal distribution occurs within 1, 2, and 3 standard deviations of the mean at zero. (Source: https://mse.redwoods.edu/darnold/math15/UsingRInStatistics/StandardNormal.php) Central Limit Theorem The Central Limit Theorem tells us that the distribution of sample means x, of samples of size n taken from any given population 1.Becomes more "normal" in shape as n increases; 2.Mean that agrees with the population mean, μ; and 3.Standard deviation equal to n/√σ, where σ is the standard deviation of the population (Source http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem
  • 16. SEEMAB AKHTAR 15 In this project, we will construct a population, and then approximate the distributions of sample means for various sample sizes through repeated sampling, so that we can "see" this theorem in action through a sequence of histograms -- as suggested by the below graphic Types of Errors in Statistics There are two types of error in statistics that is the type I & type II. In a statistical test, the Type I error is the elimination of the true null theories. In contrast, the type II error is the non- elimination of the false null hypothesis. Plenty of the statistical method rotates around the reduction of one or both kind of errors, although the complete rejection of either of them is impossible. But by choosing the low threshold value and changing the alpha level, the features of the hypothesis test could be maximized. The information on type I error & type II error is used for bio-metrics, medical science, and computer science. (https://statanalytica.com/blog/types-of-error- in-statistics/).
  • 17. SEEMAB AKHTAR 16 (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/)
  • 18. SEEMAB AKHTAR 17 (Source: Sandeep Kumar, https://www.udemy.com/user/sandeepkumar1/) (Source: https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0)
  • 19. SEEMAB AKHTAR 18 Part 2 Advanced Statistics by R Programming Trend Surface Analysis Trend Surface Analysis is the model used attempts to decompose each observation on a spatially distributed variable into a component associated with any regional trends present in the data and a component associated with purely local effects. This separation into two components is accomplished by fitting a best-fit surface by using standard regression techniques. The values predicted by this trend-surface are assigned to the regional effects whereas the local departures of the observed data from it, or residuals, are assigned to the local effects. In order to plot values on a map the geographer needs three pieces of information, the x, y spatial co-ordinates of each point together with the heights above some datum and the z co-ordinate. The z values might relate to any variable but the whole operation defines a spatial series in which the z observations are ordered with respect to the two spatial co-ordinates, x and y. The map would be completed by drawing lines of equal z value (contours or isolines) through the points. The resulting contour-type map defines a complex surface which in most cases will reveal a spatial structure or form. A trend surface analysis assumes that each mapped value can be decomposed into two components that arise from two scales of process- A) According to Krumbein and Graybill (1965) this trend is 'associated with 'large scale' systematic changes that extend from one map edge to the other'. Similarly, Grant (1961) defines trend as '. . . that part of the data that varies smoothly. In other words, it is the function that behaves predictably'. B) The combined result of two processes that operate over an area substantially smaller than the study area, random fluctuations and errors of measurement. This forms an assumed error, local component, or residual defined by Krumbein and Graybill as '. . . apparently non-systematic fluctuations that are superimposed on the large scale patterns'. It is important to notice that at the scale of observation, these residuals appear to be spatially random; they may prove to be systematically related to a spatial process but at this scale they do not vary systematically over the mapped area.
  • 20. SEEMAB AKHTAR 19 Mathematically- observed value of trend component + residual at surface at that point If component (A), the trend, varies smoothly over space its value (height) at any particular point can be expressed in terms of the spatial co-ordinates of that point, so that the basic equation of any trend analysis becomes Zobs= f(Xi+Yi)+Ui Zi= The observed value of the surface at the ith point. Xi= The co-ordinate on the x-axis (northing) of the ith data point Yi= The co-ordinate on the x-axis (northing) of the ith data point. Ui= The residual at the ith data point. f denotes “some function”, and thus the term f(Xi, Yi) indicates the trend component. By function we simply mean that if we know the location of any point i as a pair of spatial co-ordinate Xi and Yi , then the height of the trend component at this point can be found by simply plugging these Xi and Yi in to a known equation (or function). It follows that we can calculate a trend component for any combination of X and Y that f(Xi+Yi) denotes a complete surface of trend components called trend surface. Trend surface analysis is very important in a fundamental concept of geostatistics when mathematically deal with the notion of spatial information that exhibit the areas if massive values and one-of-a kind areas smaller values. Then a concept of geographical trend arises which is related to the position of spatial data because geostatistical estimation would assume stationarity and away from the spatial data would estimate with global mean (simple kriging) (S R Vieira et al., 2009). The order of the stationary hypothesis can rely on the order of the applied mathematics wished to be stagnant. Thus, once second order stationarity is needed, a minimum of the second order variable (mean and variance) should be stationary. A collection of n Values of Z (Xi), and the mean value E exist and doesn’t rely on the geographic location Xi, where Xi are going to be inherent if it follows equation 1 & 2: E{Z(Xi)} = m------------- (1)
  • 21. SEEMAB AKHTAR 20 The augmentation [Z(Xi) –Z(Xi+h)] in variance is finite and does not rest on on the geographic location Xi. This condition can be written in the form of equation as: VAR[Z(Xi)-Z(Xi+h)] = E[Z(Xi)-Z(Xi+h)]2 ---------------- (2) Where m is the mean and h is the small increment in position Xi. Order of the model The order of the polynomial model is kept as low as possible. Some transformations can be used to keep the model to be of the first order. If this is not satisfactory, then the second-order polynomial is tried. Arbitrary fitting of higher-order polynomials can be a serious abuse of regression analysis. A model which is consistent with the knowledge of data and its environment should be taken into account. It is always possible for a polynomial of order n 1 to pass through n points so that a polynomial of sufficiently high degree can always be found that provides a “good” fit to the data. Such models neither enhance the understanding of the unknown function nor be a good predictor. First Degree Polynomial Function First degree polynomials have terms with a maximum degree of 1. In other words, you wouldn’t usually find any exponents in the terms of a first degree polynomial. For example, the following are first degree polynomials:  2x + 1,  xyz + 50,  10a + 4b + 20. Second Degree Polynomial Function Second degree polynomials have at least one second degree term in the expression (e.g. 2x2 , a2 , xyz2 ). There are no higher terms (like x3 or abc5 ). The quadratic function f(x) = ax2 + bx + c is an example of a second degree polynomial.
  • 22. SEEMAB AKHTAR 21 General Form of Nth Degree Polynomial The general form used to represent nth degree polynomial is, P(x)=anxn +an−1xn-1 +an−2xn-2 +....+ a1x1 +a0 Here, a0,a1,a2,...,ana0,a1,a2,...,an are the coefficients that take numerical values as their inputs, xx is the variable, and nn is the degree of the polynomial, which is a whole number. Mathematical Polynomial Surface Trend Identification (Source: David J. Unwin, 1978) Third Degree Polynomial Function A cubic function (or third-degree polynomial) can be written as: where a, b, c, and d are constants terms, and a is nonzero.
  • 24. SEEMAB AKHTAR 23 Mann Kendall Test on Time Series Data The Mann Kendall Trend Test (sometimes called the M-K test) is used to analyze data collected over time for consistently increasing or decreasing trends (monotonic) in Y values. It is a non-parametric test, which means it works for all distributions (i.e. your data doesn’t have to meet the assumption of normality), but your data should have no serial correlation. Sen’s slope is a process for estimating the slope of trend in a sample of n pairs of data. It is a linear model Y(t) and can be described as Y(t)= Mt+C where M is the slope and C is a constant. Slopes of all data pairs of the slopes M (Sen’s slope) are calculated as: 𝑀 = 𝑋𝑖 − 𝑋𝑘 𝑗 − 𝑘 Where i= 1,2,3…………………...N, j>k. Median of the N values of Mi and the N values of Mi are ranked from the smallest to the largest is called Sen’s slope. The confidence interval about the slope (Gilbert 1987) can be calculated as: C.I.α= Z1- α/2√Var(s) Var(s) is defined in equation (3) and Z1- α/2 is estimated from the standard normal distribution table
  • 25. SEEMAB AKHTAR 24 Part 3 GIS with R Programming Polygonal Shape file, Line Shape File, Point Shape File & Contouring In this section, we will open and plot point, line, counter and polygon vector data stored in shape file format in R. Clipping and Crop This tutorial will guide you in a step-by-step process to mask and crop a raster from shape file in R. (Source: https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm)
  • 26. SEEMAB AKHTAR 25 Spatial Plot, Level plot In this part we will see an introduction to analyzing spatial data in R, specifically through map-making with R’s ‘base’ graphics and various dedicated map-making packages. It teaches the basics of using R as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS). Image Stacking and Regression pixel to pixel and box plots over Raster In this part we will learn about the Image stacking of different raster band, Regression and box plots of raster surfaces. Image stacking of raster Surfaces
  • 27. SEEMAB AKHTAR 26 References Drapela K. & Drapelova I. 2011. “Application of Mann-Kendall test and Sen’s slope estimates for trend detection in deposition data from Biky Kriz, Mendelova Univerzita V Brne, Beskydy.” Vol 4(2), pp 133-146. Gilbert, R. O. 1988. “Statistical Methods for Environmental Pollution Monitoring.” Biometrics 44(1): 319. https://desktop.arcgis.com/en/arcmap/10.3/tools/analysis-toolbox/clip.htm) https://statanalytica.com/blog/types-of-error-in-statistics/ http://mathcenter.oxford.emory.edu/site/home/futurePages/rProjectCentralLimitTheorem https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0 https://www.investopedia.com/terms/n/normaldistribution.asp https://www.r-project.org/ https://www.rstudio.com/products/rstudio/download/ https://www.udemy.com/user/sandeepkumar1/ Kampata, J. M., B. P. Parida, and D. B. Moalafhi. 2008. “Trend Analysis of Rainfall in the Headstreams of the Zambezi River Basin in Zambia.” Physics and Chemistry of the Earth 33(8– 13): 621–25. Pyrcz, M. J., & Deutsch, C. V. (2014). Geostatistical reservoir modeling. Oxford university press. Silva, Richarde Marques et al. 2015. “Rainfall and River Flow Trends Using Mann–Kendall and Sen’s Slope Estimator Statistical Tests in the Cobres River Basin.” Natural Hazards 77(2): 1205– 21. Vekaria, R. M., Shirley, D. G., Sévigny, J., & Unwin, R. J. (2006). Immunolocalization of ectonucleotidases along the rat nephron. American Journal of Physiology-Renal Physiology, 290(2), F550-F560. Vieira, Sidney Rosa, José Ruy Porto de Carvalho, Marcos Bacis Ceddia, and Antonio Paz González. 2010. “Detrending Non Stationary Data for Geostatistical Applications.” Bragantia 69(suppl): 01–08. Wu, Chunfa et al. 2011. “Spatial Interpolation of Severely Skewed Data with Several Peak Values by the Approach Integrating Kriging and Triangular Irregular Network Interpolation.” Environmental Earth Sciences 63(5): 1093–1103.