Statistical Analysis with R -I

Statistical Analysis-
Part 1
USING R
©Akhila Prabhakaran

Statistical Analysis
Using R
Data collection, presentation
and visuals
Measures of central tendency,
dispersion and correlation

What is Statistical Analysis?
 Collection
 Examination
 Summarization
 Manipulation
 Interpretation
Of What? Data
Why? To discover its underlying causes, patterns,
relationships, and trends.

Data Collection: Sources of Data
Countless Data Sources
Static:
Tabulated Data like Database, Excel
Files- could be organized or random text
JSON, XML, CSV etc.
Dynamic Data:
Web click-stream data, computer network monitoring data,
telecommunication connection data, readings from sensor nets
and stock quotes

Data Collection: R
Static Data
read.csv
read.table
Dynamic Data
stream : o data handling, plotting and easy scripting data handling, plotting
and easy scripting
rstream :Random numbers are typically created as streams quantmod:
Financial data can be obtained, Intra-day price and trading volume can be
considered a data stream.
streamR and twitteR provide interfaces to retrieve life Twitter feeds.

Reading Data in
R
Basic R function calls
getwd() # get current working directory
setwd("<new path>") # set working directory
setwd("C:/MyDoc")
?read.table
?read.csv
Reading from an Excel File
library(gdata) # load gdata package
help(read.xls) # documentation
mydata = read.xls("mydata.xls") # read from first sheet

Reading Data in
R
library(data.table)
mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(mydat)
V1 V2 V3 V4 V5
1: 1 307 930 36.58 0
2: 2 307 940 36.73 0
3: 3 307 950 36.93 0
4: 4 307 1000 37.15 0
5: 5 307 1010 37.23 0
6: 6 307 1020 37.24 0
nrow(mydat)
ncol(mydat)
colnames(mydat)
str(mydat)
Summary(mvdat)

What to
analyze?
Summary Statistics
How many ODIs , TESTS & T20s has he played?
How many innings?
What was his ODI average, what was his TEST average
and T20 average?
How consistent was his performance?

Definitions
A variable is a symbol (A, B, x, y, etc.) that can take on any of a specified set of values.
When the value of a variable is the outcome of a statistical experiment, that variable is a random variable.
Sample Space = set of all possible outcomes of an experiment.
Event = subset of the Sample Space. (example coin toss)
Generally, statisticians use a capital letter to represent a random variable and a lower-case letter, to represent one of its values.
For example,
X represents the random variable X.
P(X) represents the probability of X.
P(X = x) refers to the probability that the random variable X is equal to a particular value, denoted by x. As an example, P(X = 1)
refers to the probability that the random variable X is equal to 1.

Visualizing Data in R
install.packages("devtools")
devtools::install_github("ropenscilabs/cricinfo")
library(cricketdata)
Sachin <- fetch_player_data(35320, "ODI", "batting")
> str(Sachin)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 463 obs. of 13 variables:
$ Start_Date: Date, format: "1989-12-18" "1990-03-01" "1990-03-06" "1990-04-25" ...
$ Innings : int 2 2 1 1 2 2 2 1 2 1 ...
$ Opposition: chr "Pakistan" "New Zealand" "New Zealand" "Sri Lanka" ...
$ Ground : chr "Gujranwala" "Dunedin" "Wellington" "Sharjah" ...
$ Runs : num 0 0 36 10 20 19 31 36 53 30 ...
$ Mins : chr NA "2" "51" NA ...
$ BF : chr "2" "2" "39" "12" ...
$ X4s : chr "0" "0" "5" "0" ...
$ X6s : chr "0" "0" "0" "0" ...
$ SR : chr "0.00" "0.00" "92.30" "83.33" ...
$ Pos : chr "5" "5" "6" "5" ...
$ Dismissal : chr "caught" "caught" "caught" "run out" ...
$ Inns : chr "2" "2" "1" "1" ...

Measures of central
tendency, dispersion
and correlation
Central Tendency
Describes what is typical of
a dataset, example: Mean
Mean: 44.8

Measures of Central Tendency
 Examining the raw data is an essential first step before proceeding to
statistical analysis.
 Two key sample statistics that may be calculated from a dataset are a
measure of the central tendency of the sample distribution and of
the spread of the data about this central tendency.
 Inferential statistical analysis is dependent on a knowledge of these
descriptive statistics.
 Different measures of central tendency attempt to determine what
might variously be termed the typical, normal, expected or average
value of a dataset. Three of them are in general use for most types of
data: the mode, median, and mean.

Measures of central
tendency, dispersion and
correlation
The gold standard by which cricketers are
remembered - their average
For batsmen, this is the mean number of runs they
have scored per completed innings.
For bowlers the mean number of runs conceded per
wicket.
And these are the numbers most keenly studied by
students of the game, used to judge one player
against another, or to assess the “form” of an
individual player over the course of his career.

How good is the Mean?
ADVANTAGES The mean uses every value in the data and hence is a good representative of the data. The irony in this is that most of the times this value never
appears in the raw data.
Repeated samples drawn from the same population tend to have similar means. The mean is therefore the measure of central tendency that best resists
the fluctuation between different samples.
It is closely related to standard deviation, the most common measure of dispersion
DISADVANTAGES The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is small. It is not an appropriate
measure of central tendency for skewed distribution.
Mean cannot be calculated for nominal or nonnominal ordinal data. Even though mean can be calculated for numerical ordinal data, many times it
does not give a meaningful value, e.g. stage of cancer.

Median
The value which occupies
the middle position when
all the observations are
arranged in an
ascending/descending
order.
It divides the frequency distribution exactly into two halves. Median is the 50th
percentile.
Median is also known as ‘positional average’.
If the number of observations are odd, then (n + 1)/2th observation (in the ordered set)
is the median.
When the total number of observations are even, it is given by the mean of n/2th and
(n/2 + 1)th observation.
It is not distorted by outliers/skewed data.
It does not take into account the precise value of each observation and hence does not
use all information available in the data.
Median of the pooled group cannot be expressed in terms of the individual medians of
the pooled groups.

Categorical Variables
 Cannot compute central tendency by
Mean.
 Mode is probably the best way to
represent the central tendency for these
variables.
 Mode : Defined as the value that occurs
most frequently in the data. Can have
more than 1 value.
 Not a good representative for small
samples.

Dispersion
VARIABILITY/SPREAD IN THE
VALUES

Measurement of spread of data
(variability)
 The first step in assessing spread of data is to examine it in either a table
or an appropriate graphical form.
 A graph often makes clear any symmetry (or lack of it) in the spread of
data, whether there are obvious atypical values (outliers) and whether the
data is skewed in one direction or the other (a tendency for more values
to fall in the upper or lower tail of the distribution).
 Range : highest value - lowest value
 Percentiles: Q% of data is <= x, then x is the Qth percentile
 Variance / Standard Deviation:

Percentile
Percentile Meaning
1st Quartile 25% of values are less than this
2nd Quartile 50% of values are less than this
3rd Quartile 75% of values are less than this
4th Quartile 100% of values are less than this
>summary(Sachin$Runs)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 8.00 28.50 40.77 63.00 200.00 11

Variance &
Standard Deviation
 The average of
the squared differences from
the Mean.
 SD = Square root of variance > var(Sachin$Runs)
NA
> var(Sachin$Runs, na.rm=TRUE)
1603.16
> sd(Sachin$Runs)
NA
> sd(Sachin$Runs, na.rm=TRUE)
40.03948

Covariance
A MEASURE OF THE JOINT
VARIABILITY OF
TWO RANDOM VARIABLES.

Pearson
Correlation
Coefficient

Describing data by tables and graphs
Qualitative variable (Category Variables)
The number of observations that fall into
particular class (or category) of the
qualitative variable is called the frequency
(or count) of that class. A table listing all
classes and their frequencies is called a
frequency distribution.

Histograms
for
Frequency
Distributions

Describing data by tables and graphs
Relative Frequency Distribution
>
as.data.frame(table(Sachin$Op
position)/nrow(Sachin))
Var1 Freq
1 Australia 0.153347732
2 Bangladesh 0.025917927
3 Bermuda 0.002159827
4 England 0.079913607
5 Ireland 0.004319654
6 Kenya 0.021598272
7 Namibia 0.002159827
8 Netherlands 0.004319654
9 New Zealand 0.090712743
10 Pakistan 0.149028078
11 South Africa 0.123110151
12 Sri Lanka 0.181425486
13 U.A.E. 0.004319654
14 West Indies 0.084233261
15 Zimbabwe 0.073434125
>
as.data.frame(table(S
achin$Opposition))
Var1 Freq
1 Australia 71
2 Bangladesh 12
3 Bermuda 1
4 England 37
5 Ireland 2
6 Kenya 10
7 Namibia 1
8 Netherlands 2
9 New Zealand 42
10 Pakistan 69
11 South Africa 57
12 Sri Lanka 84
13 U.A.E. 2
14 West Indies 39
15 Zimbabwe 34
Frequency Distribution
The number of observations
that fall into particular class

Quantitative Variables
Frequency distribution (Binning)
Var1 Freq
1 (0,10] 112
2 (10,20] 56
3 (20,30] 47
4 (30,40] 46
5 (40,50] 28
6 (50,60] 21
7 (60,70] 29
8 (70,80] 9
9 (80,90] 18
10 (90,100] 23
11 (100,110] 8
12 (110,120] 12
13 (120,130] 7
14 (130,140] 5
15 (140,150] 6
16 (150,160] 1
17 (160,170] 1
18 (170,180] 1
19 (180,190] 1
20 (190,200] 1

Frequency distribution (Binning)
> runs <- Sachin[!is.na(Sachin$Runs), "Runs"]
> range(runs)
[1] 0 200
> breaks = seq(0, 200, by=10)# 10-integer sequence
> breaks
[1] 0 10 20 30 40 50 60 70 80 90 100 110
120 130 140 150 160 170 180 190 200
>cut(Sachin$Runs, breaks)
>as.data.frame(table(cut(Sachin$Runs, breaks)))
Var1 Freq
1 (0,10] 112
2 (10,20] 56
3 (20,30] 47
4 (30,40] 46
5 (40,50] 28
6 (50,60] 21
7 (60,70] 29
8 (70,80] 9
9 (80,90] 18
10 (90,100] 23
11 (100,110] 8
12 (110,120] 12
13 (120,130] 7
14 (130,140] 5
15 (140,150] 6
16 (150,160] 1
17 (160,170] 1
18 (170,180] 1
19 (180,190] 1
20 (190,200] 1
Exercise: Find the relative frequency distribution of runs of a player

Cumulative frequency
Cumulative relative frequency
Exercise 2: Find the cumulative frequency distribution of
runs of a player (?cumsum)
Exercise 3: Find the relative cumulative frequency
distribution of runs of a player

Cumulative frequency
Cumulative relative frequency
breaks = seq(0, 200, by=10)# 10-integer sequence
breaks
#[1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
cut(Sachin$Runs, breaks)
freq <- as.data.frame(table(cut(Sachin$Runs, breaks)))
runs.cumfreq = table(cumsum(table(cut(Sachin$Runs, breaks))))
cbind(as.character(freq$Var1), cumsum(freq$Freq))

Resources for R
https://www.cs.upc.edu/~robert/teaching/estadistica/rprogramming.pdf
https://www.r-bloggers.com/
https://www.r-bloggers.com/how-to-make-a-histogram-with-basic-r/
https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
http://t-redactyl.io/blog/2015/12/creating-plots-in-r-using-ggplot2-part-1-line-plots.html

Data
Visualization
using
ggplot2

Data
Visualization
using ggplot2
colour <- c("#40b8d0", "#b2d183")
 p1 <- ggplot() +
 geom_line(aes(y = export, x = year, colour = product), size=1.5,
 data = charts.data, stat="identity") +
 theme(legend.position="bottom", legend.direction="horizontal",
 legend.title = element_blank()) +
 scale_x_continuous(breaks=seq(2006,2014,1)) +
 labs(x="Year", y="USD million") +
 ggtitle("Composition of Exports to China ($)") +
 scale_colour_manual(values=colour) +
 theme(axis.line = element_line(size=1, colour = "black"),
 panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
 panel.border = element_blank(), panel.background = element_blank())
+
 theme(plot.title = element_text(size = 14, family = "Tahoma", face =
"bold"),
 text=element_text(family="Tahoma"),
 axis.text.x=element_text(colour="black", size = 10),
 axis.text.y=element_text(colour="black", size = 10),
 legend.key=element_rect(fill="white", colour="white"))
p1

Statistical Analysis with R -I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical Analysis with R -I

Similar to Statistical Analysis with R -I (20)

More from Akhila Prabhakaran

More from Akhila Prabhakaran (9)

Recently uploaded

Recently uploaded (20)

Statistical Analysis with R -I