SlideShare a Scribd company logo
1 of 34
Statistical Analysis-
Part 1
USING R
©Akhila Prabhakaran
Statistical Analysis
Using R
Data collection, presentation
and visuals
Measures of central tendency,
dispersion and correlation
©Akhila Prabhakaran
What is Statistical Analysis?
 Collection
 Examination
 Summarization
 Manipulation
 Interpretation
Of What? Data
Why? To discover its underlying causes, patterns,
relationships, and trends.
©Akhila Prabhakaran
Data Collection: Sources of Data
Countless Data Sources
Static:
Tabulated Data like Database, Excel
Files- could be organized or random text
JSON, XML, CSV etc.
Dynamic Data:
Web click-stream data, computer network monitoring data,
telecommunication connection data, readings from sensor nets
and stock quotes
©Akhila Prabhakaran
Data Collection: R
Static Data
read.csv
read.table
Dynamic Data
stream : o data handling, plotting and easy scripting data handling, plotting
and easy scripting
rstream :Random numbers are typically created as streams quantmod:
Financial data can be obtained, Intra-day price and trading volume can be
considered a data stream.
streamR and twitteR provide interfaces to retrieve life Twitter feeds.
©Akhila Prabhakaran
Reading Data in
R
Basic R function calls
getwd() # get current working directory
setwd("<new path>") # set working directory
setwd("C:/MyDoc")
?read.table
?read.csv
Reading from an Excel File
library(gdata) # load gdata package
help(read.xls) # documentation
mydata = read.xls("mydata.xls") # read from first sheet
©Akhila Prabhakaran
Reading Data in
R
library(data.table)
mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(mydat)
V1 V2 V3 V4 V5
1: 1 307 930 36.58 0
2: 2 307 940 36.73 0
3: 3 307 950 36.93 0
4: 4 307 1000 37.15 0
5: 5 307 1010 37.23 0
6: 6 307 1020 37.24 0
nrow(mydat)
ncol(mydat)
colnames(mydat)
str(mydat)
Summary(mvdat)
©Akhila Prabhakaran
What to
analyze?
Summary Statistics
How many ODIs , TESTS & T20s has he played?
How many innings?
What was his ODI average, what was his TEST average
and T20 average?
How consistent was his performance?
©Akhila Prabhakaran
Definitions
A variable is a symbol (A, B, x, y, etc.) that can take on any of a specified set of values.
When the value of a variable is the outcome of a statistical experiment, that variable is a random variable.
Sample Space = set of all possible outcomes of an experiment.
Event = subset of the Sample Space. (example coin toss)
Generally, statisticians use a capital letter to represent a random variable and a lower-case letter, to represent one of its values.
For example,
X represents the random variable X.
P(X) represents the probability of X.
P(X = x) refers to the probability that the random variable X is equal to a particular value, denoted by x. As an example, P(X = 1)
refers to the probability that the random variable X is equal to 1.
©Akhila Prabhakaran
Visualizing Data in R
install.packages("devtools")
devtools::install_github("ropenscilabs/cricinfo")
library(cricketdata)
Sachin <- fetch_player_data(35320, "ODI", "batting")
> str(Sachin)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 463 obs. of 13 variables:
$ Start_Date: Date, format: "1989-12-18" "1990-03-01" "1990-03-06" "1990-04-25" ...
$ Innings : int 2 2 1 1 2 2 2 1 2 1 ...
$ Opposition: chr "Pakistan" "New Zealand" "New Zealand" "Sri Lanka" ...
$ Ground : chr "Gujranwala" "Dunedin" "Wellington" "Sharjah" ...
$ Runs : num 0 0 36 10 20 19 31 36 53 30 ...
$ Mins : chr NA "2" "51" NA ...
$ BF : chr "2" "2" "39" "12" ...
$ X4s : chr "0" "0" "5" "0" ...
$ X6s : chr "0" "0" "0" "0" ...
$ SR : chr "0.00" "0.00" "92.30" "83.33" ...
$ Pos : chr "5" "5" "6" "5" ...
$ Dismissal : chr "caught" "caught" "caught" "run out" ...
$ Inns : chr "2" "2" "1" "1" ...
©Akhila Prabhakaran
Measures of central
tendency, dispersion
and correlation
Central Tendency
Describes what is typical of
a dataset, example: Mean
Mean: 44.8
©Akhila Prabhakaran
Measures of Central Tendency
 Examining the raw data is an essential first step before proceeding to
statistical analysis.
 Two key sample statistics that may be calculated from a dataset are a
measure of the central tendency of the sample distribution and of
the spread of the data about this central tendency.
 Inferential statistical analysis is dependent on a knowledge of these
descriptive statistics.
 Different measures of central tendency attempt to determine what
might variously be termed the typical, normal, expected or average
value of a dataset. Three of them are in general use for most types of
data: the mode, median, and mean.
©Akhila Prabhakaran
Measures of central
tendency, dispersion and
correlation
The gold standard by which cricketers are
remembered - their average
For batsmen, this is the mean number of runs they
have scored per completed innings.
For bowlers the mean number of runs conceded per
wicket.
And these are the numbers most keenly studied by
students of the game, used to judge one player
against another, or to assess the “form” of an
individual player over the course of his career.
©Akhila Prabhakaran
How good is the Mean?
ADVANTAGES The mean uses every value in the data and hence is a good representative of the data. The irony in this is that most of the times this value never
appears in the raw data.
Repeated samples drawn from the same population tend to have similar means. The mean is therefore the measure of central tendency that best resists
the fluctuation between different samples.
It is closely related to standard deviation, the most common measure of dispersion
DISADVANTAGES The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is small. It is not an appropriate
measure of central tendency for skewed distribution.
Mean cannot be calculated for nominal or nonnominal ordinal data. Even though mean can be calculated for numerical ordinal data, many times it
does not give a meaningful value, e.g. stage of cancer.
©Akhila Prabhakaran
Median
The value which occupies
the middle position when
all the observations are
arranged in an
ascending/descending
order.
It divides the frequency distribution exactly into two halves. Median is the 50th
percentile.
Median is also known as ‘positional average’.
If the number of observations are odd, then (n + 1)/2th observation (in the ordered set)
is the median.
When the total number of observations are even, it is given by the mean of n/2th and
(n/2 + 1)th observation.
It is not distorted by outliers/skewed data.
It does not take into account the precise value of each observation and hence does not
use all information available in the data.
Median of the pooled group cannot be expressed in terms of the individual medians of
the pooled groups.
Categorical Variables
 Cannot compute central tendency by
Mean.
 Mode is probably the best way to
represent the central tendency for these
variables.
 Mode : Defined as the value that occurs
most frequently in the data. Can have
more than 1 value.
 Not a good representative for small
samples.
Dispersion
VARIABILITY/SPREAD IN THE
VALUES
Measurement of spread of data
(variability)
 The first step in assessing spread of data is to examine it in either a table
or an appropriate graphical form.
 A graph often makes clear any symmetry (or lack of it) in the spread of
data, whether there are obvious atypical values (outliers) and whether the
data is skewed in one direction or the other (a tendency for more values
to fall in the upper or lower tail of the distribution).
 Range : highest value - lowest value
 Percentiles: Q% of data is <= x, then x is the Qth percentile
 Variance / Standard Deviation:
Percentile
Percentile Meaning
1st Quartile 25% of values are less than this
2nd Quartile 50% of values are less than this
3rd Quartile 75% of values are less than this
4th Quartile 100% of values are less than this
>summary(Sachin$Runs)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 8.00 28.50 40.77 63.00 200.00 11
Variance &
Standard Deviation
 The average of
the squared differences from
the Mean.
 SD = Square root of variance > var(Sachin$Runs)
NA
> var(Sachin$Runs, na.rm=TRUE)
1603.16
> sd(Sachin$Runs)
NA
> sd(Sachin$Runs, na.rm=TRUE)
40.03948
Covariance
A MEASURE OF THE JOINT
VARIABILITY OF
TWO RANDOM VARIABLES.
Covariance
Pearson
Correlation
Coefficient
Pearson
Correlation
Coefficient
Describing data by tables and graphs
Qualitative variable (Category Variables)
The number of observations that fall into
particular class (or category) of the
qualitative variable is called the frequency
(or count) of that class. A table listing all
classes and their frequencies is called a
frequency distribution.
Histograms
for
Frequency
Distributions
Describing data by tables and graphs
Relative Frequency Distribution
>
as.data.frame(table(Sachin$Op
position)/nrow(Sachin))
Var1 Freq
1 Australia 0.153347732
2 Bangladesh 0.025917927
3 Bermuda 0.002159827
4 England 0.079913607
5 Ireland 0.004319654
6 Kenya 0.021598272
7 Namibia 0.002159827
8 Netherlands 0.004319654
9 New Zealand 0.090712743
10 Pakistan 0.149028078
11 South Africa 0.123110151
12 Sri Lanka 0.181425486
13 U.A.E. 0.004319654
14 West Indies 0.084233261
15 Zimbabwe 0.073434125
>
as.data.frame(table(S
achin$Opposition))
Var1 Freq
1 Australia 71
2 Bangladesh 12
3 Bermuda 1
4 England 37
5 Ireland 2
6 Kenya 10
7 Namibia 1
8 Netherlands 2
9 New Zealand 42
10 Pakistan 69
11 South Africa 57
12 Sri Lanka 84
13 U.A.E. 2
14 West Indies 39
15 Zimbabwe 34
Frequency Distribution
The number of observations
that fall into particular class
Quantitative Variables
Frequency distribution (Binning)
Var1 Freq
1 (0,10] 112
2 (10,20] 56
3 (20,30] 47
4 (30,40] 46
5 (40,50] 28
6 (50,60] 21
7 (60,70] 29
8 (70,80] 9
9 (80,90] 18
10 (90,100] 23
11 (100,110] 8
12 (110,120] 12
13 (120,130] 7
14 (130,140] 5
15 (140,150] 6
16 (150,160] 1
17 (160,170] 1
18 (170,180] 1
19 (180,190] 1
20 (190,200] 1
Quantitative Variables
Frequency distribution (Binning)
> runs <- Sachin[!is.na(Sachin$Runs), "Runs"]
> range(runs)
[1] 0 200
> breaks = seq(0, 200, by=10)# 10-integer sequence
> breaks
[1] 0 10 20 30 40 50 60 70 80 90 100 110
120 130 140 150 160 170 180 190 200
>cut(Sachin$Runs, breaks)
>as.data.frame(table(cut(Sachin$Runs, breaks)))
Var1 Freq
1 (0,10] 112
2 (10,20] 56
3 (20,30] 47
4 (30,40] 46
5 (40,50] 28
6 (50,60] 21
7 (60,70] 29
8 (70,80] 9
9 (80,90] 18
10 (90,100] 23
11 (100,110] 8
12 (110,120] 12
13 (120,130] 7
14 (130,140] 5
15 (140,150] 6
16 (150,160] 1
17 (160,170] 1
18 (170,180] 1
19 (180,190] 1
20 (190,200] 1
Exercise: Find the relative frequency distribution of runs of a player
Quantitative Variables
Cumulative frequency
Cumulative relative frequency
Exercise 2: Find the cumulative frequency distribution of
runs of a player (?cumsum)
Exercise 3: Find the relative cumulative frequency
distribution of runs of a player
Quantitative Variables
Cumulative frequency
Cumulative relative frequency
breaks = seq(0, 200, by=10)# 10-integer sequence
breaks
#[1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
cut(Sachin$Runs, breaks)
freq <- as.data.frame(table(cut(Sachin$Runs, breaks)))
runs.cumfreq = table(cumsum(table(cut(Sachin$Runs, breaks))))
cbind(as.character(freq$Var1), cumsum(freq$Freq))
©Akhila Prabhakaran
Resources for R
https://www.cs.upc.edu/~robert/teaching/estadistica/rprogramming.pdf
https://www.r-bloggers.com/
https://www.r-bloggers.com/how-to-make-a-histogram-with-basic-r/
https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
http://t-redactyl.io/blog/2015/12/creating-plots-in-r-using-ggplot2-part-1-line-plots.html
Data
Visualization
using
ggplot2
Data
Visualization
using ggplot2
colour <- c("#40b8d0", "#b2d183")
 p1 <- ggplot() +
 geom_line(aes(y = export, x = year, colour = product), size=1.5,
 data = charts.data, stat="identity") +
 theme(legend.position="bottom", legend.direction="horizontal",
 legend.title = element_blank()) +
 scale_x_continuous(breaks=seq(2006,2014,1)) +
 labs(x="Year", y="USD million") +
 ggtitle("Composition of Exports to China ($)") +
 scale_colour_manual(values=colour) +
 theme(axis.line = element_line(size=1, colour = "black"),
 panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
 panel.border = element_blank(), panel.background = element_blank())
+
 theme(plot.title = element_text(size = 14, family = "Tahoma", face =
"bold"),
 text=element_text(family="Tahoma"),
 axis.text.x=element_text(colour="black", size = 10),
 axis.text.y=element_text(colour="black", size = 10),
 legend.key=element_rect(fill="white", colour="white"))
p1

More Related Content

What's hot

Probability basics and bayes' theorem
Probability basics and bayes' theoremProbability basics and bayes' theorem
Probability basics and bayes' theoremBalaji P
 
Standard normal distribution
Standard normal distributionStandard normal distribution
Standard normal distributionNadeem Uddin
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsBabasab Patil
 
Probability Distributions for Discrete Variables
Probability Distributions for Discrete VariablesProbability Distributions for Discrete Variables
Probability Distributions for Discrete Variablesgetyourcheaton
 
Exploring bivariate data
Exploring bivariate dataExploring bivariate data
Exploring bivariate dataUlster BOCES
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAnand Thokal
 
Applications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large NumbersApplications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large NumbersUniversity of Salerno
 
Relative frequency distribution
Relative frequency distributionRelative frequency distribution
Relative frequency distributionNadeem Uddin
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to RstudioOlga Scrivner
 
Basic Descriptive statistics
Basic Descriptive statisticsBasic Descriptive statistics
Basic Descriptive statisticsAjendra Sharma
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAttaullah Khan
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsMmedsc Hahm
 

What's hot (20)

Probability basics and bayes' theorem
Probability basics and bayes' theoremProbability basics and bayes' theorem
Probability basics and bayes' theorem
 
Standard normal distribution
Standard normal distributionStandard normal distribution
Standard normal distribution
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
 
Probability Distributions for Discrete Variables
Probability Distributions for Discrete VariablesProbability Distributions for Discrete Variables
Probability Distributions for Discrete Variables
 
Exploring bivariate data
Exploring bivariate dataExploring bivariate data
Exploring bivariate data
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Applications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large NumbersApplications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large Numbers
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Relative frequency distribution
Relative frequency distributionRelative frequency distribution
Relative frequency distribution
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to Rstudio
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Basic Descriptive statistics
Basic Descriptive statisticsBasic Descriptive statistics
Basic Descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Confidence Intervals
Confidence IntervalsConfidence Intervals
Confidence Intervals
 
6. R data structures
6. R data structures6. R data structures
6. R data structures
 
Probability
ProbabilityProbability
Probability
 
Random variables
Random variablesRandom variables
Random variables
 

Similar to Statistical Analysis with R -I

Biostatistics
BiostatisticsBiostatistics
Biostatisticspriyarokz
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxMuhammadNafees42
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of datadrasifk
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docxscet315
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Statistical treatment and data processing copy
Statistical treatment and data processing   copyStatistical treatment and data processing   copy
Statistical treatment and data processing copySWEET PEARL GAMAYON
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyPrithwis Mukerjee
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyPrithwis Mukerjee
 
Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality ManagementDr.Raja R
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1Ashwini Mathur
 
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.pptBRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.pptAbdifatahAhmedHurre
 
PG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statisticsPG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statisticsAashish Patel
 
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfMSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfSuchita Rawat
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive StatisticsBhagya Silva
 
Properties of Standard Deviation
Properties of Standard DeviationProperties of Standard Deviation
Properties of Standard DeviationRizwan Sharif
 

Similar to Statistical Analysis with R -I (20)

Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Stat11t chapter3
Stat11t chapter3Stat11t chapter3
Stat11t chapter3
 
Statistical treatment and data processing copy
Statistical treatment and data processing   copyStatistical treatment and data processing   copy
Statistical treatment and data processing copy
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
statistics
statisticsstatistics
statistics
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality Management
 
Rj Prashant's ppts on statistics
Rj Prashant's ppts on statisticsRj Prashant's ppts on statistics
Rj Prashant's ppts on statistics
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1
 
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.pptBRM_Data Analysis, Interpretation and Reporting Part II.ppt
BRM_Data Analysis, Interpretation and Reporting Part II.ppt
 
PG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statisticsPG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statistics
 
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfMSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Properties of Standard Deviation
Properties of Standard DeviationProperties of Standard Deviation
Properties of Standard Deviation
 

More from Akhila Prabhakaran

More from Akhila Prabhakaran (9)

Re Imagining Education
Re Imagining EducationRe Imagining Education
Re Imagining Education
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Hypothesis testing Part1
Hypothesis testing Part1Hypothesis testing Part1
Hypothesis testing Part1
 
Statistical Analysis with R- III
Statistical Analysis with R- IIIStatistical Analysis with R- III
Statistical Analysis with R- III
 
Statistical Analysis with R -II
Statistical Analysis with R -IIStatistical Analysis with R -II
Statistical Analysis with R -II
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 

Recently uploaded

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 

Recently uploaded (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 

Statistical Analysis with R -I

  • 1. Statistical Analysis- Part 1 USING R ©Akhila Prabhakaran
  • 2. Statistical Analysis Using R Data collection, presentation and visuals Measures of central tendency, dispersion and correlation ©Akhila Prabhakaran
  • 3. What is Statistical Analysis?  Collection  Examination  Summarization  Manipulation  Interpretation Of What? Data Why? To discover its underlying causes, patterns, relationships, and trends. ©Akhila Prabhakaran
  • 4. Data Collection: Sources of Data Countless Data Sources Static: Tabulated Data like Database, Excel Files- could be organized or random text JSON, XML, CSV etc. Dynamic Data: Web click-stream data, computer network monitoring data, telecommunication connection data, readings from sensor nets and stock quotes ©Akhila Prabhakaran
  • 5. Data Collection: R Static Data read.csv read.table Dynamic Data stream : o data handling, plotting and easy scripting data handling, plotting and easy scripting rstream :Random numbers are typically created as streams quantmod: Financial data can be obtained, Intra-day price and trading volume can be considered a data stream. streamR and twitteR provide interfaces to retrieve life Twitter feeds. ©Akhila Prabhakaran
  • 6. Reading Data in R Basic R function calls getwd() # get current working directory setwd("<new path>") # set working directory setwd("C:/MyDoc") ?read.table ?read.csv Reading from an Excel File library(gdata) # load gdata package help(read.xls) # documentation mydata = read.xls("mydata.xls") # read from first sheet ©Akhila Prabhakaran
  • 7. Reading Data in R library(data.table) mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat') head(mydat) V1 V2 V3 V4 V5 1: 1 307 930 36.58 0 2: 2 307 940 36.73 0 3: 3 307 950 36.93 0 4: 4 307 1000 37.15 0 5: 5 307 1010 37.23 0 6: 6 307 1020 37.24 0 nrow(mydat) ncol(mydat) colnames(mydat) str(mydat) Summary(mvdat) ©Akhila Prabhakaran
  • 8. What to analyze? Summary Statistics How many ODIs , TESTS & T20s has he played? How many innings? What was his ODI average, what was his TEST average and T20 average? How consistent was his performance? ©Akhila Prabhakaran
  • 9. Definitions A variable is a symbol (A, B, x, y, etc.) that can take on any of a specified set of values. When the value of a variable is the outcome of a statistical experiment, that variable is a random variable. Sample Space = set of all possible outcomes of an experiment. Event = subset of the Sample Space. (example coin toss) Generally, statisticians use a capital letter to represent a random variable and a lower-case letter, to represent one of its values. For example, X represents the random variable X. P(X) represents the probability of X. P(X = x) refers to the probability that the random variable X is equal to a particular value, denoted by x. As an example, P(X = 1) refers to the probability that the random variable X is equal to 1. ©Akhila Prabhakaran
  • 10. Visualizing Data in R install.packages("devtools") devtools::install_github("ropenscilabs/cricinfo") library(cricketdata) Sachin <- fetch_player_data(35320, "ODI", "batting") > str(Sachin) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 463 obs. of 13 variables: $ Start_Date: Date, format: "1989-12-18" "1990-03-01" "1990-03-06" "1990-04-25" ... $ Innings : int 2 2 1 1 2 2 2 1 2 1 ... $ Opposition: chr "Pakistan" "New Zealand" "New Zealand" "Sri Lanka" ... $ Ground : chr "Gujranwala" "Dunedin" "Wellington" "Sharjah" ... $ Runs : num 0 0 36 10 20 19 31 36 53 30 ... $ Mins : chr NA "2" "51" NA ... $ BF : chr "2" "2" "39" "12" ... $ X4s : chr "0" "0" "5" "0" ... $ X6s : chr "0" "0" "0" "0" ... $ SR : chr "0.00" "0.00" "92.30" "83.33" ... $ Pos : chr "5" "5" "6" "5" ... $ Dismissal : chr "caught" "caught" "caught" "run out" ... $ Inns : chr "2" "2" "1" "1" ... ©Akhila Prabhakaran
  • 11. Measures of central tendency, dispersion and correlation Central Tendency Describes what is typical of a dataset, example: Mean Mean: 44.8 ©Akhila Prabhakaran
  • 12. Measures of Central Tendency  Examining the raw data is an essential first step before proceeding to statistical analysis.  Two key sample statistics that may be calculated from a dataset are a measure of the central tendency of the sample distribution and of the spread of the data about this central tendency.  Inferential statistical analysis is dependent on a knowledge of these descriptive statistics.  Different measures of central tendency attempt to determine what might variously be termed the typical, normal, expected or average value of a dataset. Three of them are in general use for most types of data: the mode, median, and mean. ©Akhila Prabhakaran
  • 13. Measures of central tendency, dispersion and correlation The gold standard by which cricketers are remembered - their average For batsmen, this is the mean number of runs they have scored per completed innings. For bowlers the mean number of runs conceded per wicket. And these are the numbers most keenly studied by students of the game, used to judge one player against another, or to assess the “form” of an individual player over the course of his career. ©Akhila Prabhakaran
  • 14. How good is the Mean? ADVANTAGES The mean uses every value in the data and hence is a good representative of the data. The irony in this is that most of the times this value never appears in the raw data. Repeated samples drawn from the same population tend to have similar means. The mean is therefore the measure of central tendency that best resists the fluctuation between different samples. It is closely related to standard deviation, the most common measure of dispersion DISADVANTAGES The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is small. It is not an appropriate measure of central tendency for skewed distribution. Mean cannot be calculated for nominal or nonnominal ordinal data. Even though mean can be calculated for numerical ordinal data, many times it does not give a meaningful value, e.g. stage of cancer. ©Akhila Prabhakaran
  • 15. Median The value which occupies the middle position when all the observations are arranged in an ascending/descending order. It divides the frequency distribution exactly into two halves. Median is the 50th percentile. Median is also known as ‘positional average’. If the number of observations are odd, then (n + 1)/2th observation (in the ordered set) is the median. When the total number of observations are even, it is given by the mean of n/2th and (n/2 + 1)th observation. It is not distorted by outliers/skewed data. It does not take into account the precise value of each observation and hence does not use all information available in the data. Median of the pooled group cannot be expressed in terms of the individual medians of the pooled groups.
  • 16. Categorical Variables  Cannot compute central tendency by Mean.  Mode is probably the best way to represent the central tendency for these variables.  Mode : Defined as the value that occurs most frequently in the data. Can have more than 1 value.  Not a good representative for small samples.
  • 18. Measurement of spread of data (variability)  The first step in assessing spread of data is to examine it in either a table or an appropriate graphical form.  A graph often makes clear any symmetry (or lack of it) in the spread of data, whether there are obvious atypical values (outliers) and whether the data is skewed in one direction or the other (a tendency for more values to fall in the upper or lower tail of the distribution).  Range : highest value - lowest value  Percentiles: Q% of data is <= x, then x is the Qth percentile  Variance / Standard Deviation:
  • 19. Percentile Percentile Meaning 1st Quartile 25% of values are less than this 2nd Quartile 50% of values are less than this 3rd Quartile 75% of values are less than this 4th Quartile 100% of values are less than this >summary(Sachin$Runs) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 8.00 28.50 40.77 63.00 200.00 11
  • 20. Variance & Standard Deviation  The average of the squared differences from the Mean.  SD = Square root of variance > var(Sachin$Runs) NA > var(Sachin$Runs, na.rm=TRUE) 1603.16 > sd(Sachin$Runs) NA > sd(Sachin$Runs, na.rm=TRUE) 40.03948
  • 21. Covariance A MEASURE OF THE JOINT VARIABILITY OF TWO RANDOM VARIABLES.
  • 25. Describing data by tables and graphs Qualitative variable (Category Variables) The number of observations that fall into particular class (or category) of the qualitative variable is called the frequency (or count) of that class. A table listing all classes and their frequencies is called a frequency distribution.
  • 27. Describing data by tables and graphs Relative Frequency Distribution > as.data.frame(table(Sachin$Op position)/nrow(Sachin)) Var1 Freq 1 Australia 0.153347732 2 Bangladesh 0.025917927 3 Bermuda 0.002159827 4 England 0.079913607 5 Ireland 0.004319654 6 Kenya 0.021598272 7 Namibia 0.002159827 8 Netherlands 0.004319654 9 New Zealand 0.090712743 10 Pakistan 0.149028078 11 South Africa 0.123110151 12 Sri Lanka 0.181425486 13 U.A.E. 0.004319654 14 West Indies 0.084233261 15 Zimbabwe 0.073434125 > as.data.frame(table(S achin$Opposition)) Var1 Freq 1 Australia 71 2 Bangladesh 12 3 Bermuda 1 4 England 37 5 Ireland 2 6 Kenya 10 7 Namibia 1 8 Netherlands 2 9 New Zealand 42 10 Pakistan 69 11 South Africa 57 12 Sri Lanka 84 13 U.A.E. 2 14 West Indies 39 15 Zimbabwe 34 Frequency Distribution The number of observations that fall into particular class
  • 28. Quantitative Variables Frequency distribution (Binning) Var1 Freq 1 (0,10] 112 2 (10,20] 56 3 (20,30] 47 4 (30,40] 46 5 (40,50] 28 6 (50,60] 21 7 (60,70] 29 8 (70,80] 9 9 (80,90] 18 10 (90,100] 23 11 (100,110] 8 12 (110,120] 12 13 (120,130] 7 14 (130,140] 5 15 (140,150] 6 16 (150,160] 1 17 (160,170] 1 18 (170,180] 1 19 (180,190] 1 20 (190,200] 1
  • 29. Quantitative Variables Frequency distribution (Binning) > runs <- Sachin[!is.na(Sachin$Runs), "Runs"] > range(runs) [1] 0 200 > breaks = seq(0, 200, by=10)# 10-integer sequence > breaks [1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 >cut(Sachin$Runs, breaks) >as.data.frame(table(cut(Sachin$Runs, breaks))) Var1 Freq 1 (0,10] 112 2 (10,20] 56 3 (20,30] 47 4 (30,40] 46 5 (40,50] 28 6 (50,60] 21 7 (60,70] 29 8 (70,80] 9 9 (80,90] 18 10 (90,100] 23 11 (100,110] 8 12 (110,120] 12 13 (120,130] 7 14 (130,140] 5 15 (140,150] 6 16 (150,160] 1 17 (160,170] 1 18 (170,180] 1 19 (180,190] 1 20 (190,200] 1 Exercise: Find the relative frequency distribution of runs of a player
  • 30. Quantitative Variables Cumulative frequency Cumulative relative frequency Exercise 2: Find the cumulative frequency distribution of runs of a player (?cumsum) Exercise 3: Find the relative cumulative frequency distribution of runs of a player
  • 31. Quantitative Variables Cumulative frequency Cumulative relative frequency breaks = seq(0, 200, by=10)# 10-integer sequence breaks #[1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 cut(Sachin$Runs, breaks) freq <- as.data.frame(table(cut(Sachin$Runs, breaks))) runs.cumfreq = table(cumsum(table(cut(Sachin$Runs, breaks)))) cbind(as.character(freq$Var1), cumsum(freq$Freq)) ©Akhila Prabhakaran
  • 34. Data Visualization using ggplot2 colour <- c("#40b8d0", "#b2d183")  p1 <- ggplot() +  geom_line(aes(y = export, x = year, colour = product), size=1.5,  data = charts.data, stat="identity") +  theme(legend.position="bottom", legend.direction="horizontal",  legend.title = element_blank()) +  scale_x_continuous(breaks=seq(2006,2014,1)) +  labs(x="Year", y="USD million") +  ggtitle("Composition of Exports to China ($)") +  scale_colour_manual(values=colour) +  theme(axis.line = element_line(size=1, colour = "black"),  panel.grid.major = element_line(colour = "#d3d3d3"), panel.grid.minor = element_blank(),  panel.border = element_blank(), panel.background = element_blank()) +  theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),  text=element_text(family="Tahoma"),  axis.text.x=element_text(colour="black", size = 10),  axis.text.y=element_text(colour="black", size = 10),  legend.key=element_rect(fill="white", colour="white")) p1