SlideShare a Scribd company logo
1 of 21
DESCRIPTIVE STATISTICS USING R PROGRAMMING ALONG
WITH THE DEMONSTRATION
IRIS FLOWER FEATURES:
IRIS-DATA-SET:
Descriptive statistics consist of describing simply the data using some summary
statistics and graphics.
Import your data into R
1. Prepare your data
2. Save your data in an external .txt tab or .csv files
3. Import your data into R as follow:
# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())
Here, we’ll use the built-in R data set named iris.
# Store the data in the variable my_data
my_data <- iris
Check your data
You can inspect your data using the functions head() and tails(), which will display the first
and the last part of the data, respectively.
# Print the first 6 rows
head(my_data, 6)
----------------------------------------------------------------------------------
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
R functions for computing descriptive statistics
Some R functions for computing descriptive statistics:
Description R function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum maximum()
Median median()
Range of values (minimum and maximum) range()
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()
The function mfv(), for most frequent value, [in modeest package] can be used to
find the statistical mode of a numeric vector. (For Finding the MODE)
Descriptive statistics for a single group
Measure of central tendency: mean, median, mode
The central tendency measures the “average” or the “middle” of your data. The most
commonly used measures include:
● the mean: the average value. It’s sensitive to outliers.
● the median: the middle value. It’s a robust alternative to mean.
● and the mode: the most frequent value
In R,
● The function mean() and median() can be used to compute the mean and
the median, respectively;
● The function mfv() [in the modeest R package] can be used to compute the
mode of a variable.
The R code below computes the mean, median and the mode of the variable Sepal.Length
[in my_data data set]:
# Compute the mean value
mean(my_data$Sepal.Length)
[1] 5.843333
# Compute the median value
median(my_data$Sepal.Length)
[1] 5.8
# Compute the mode
# install.packages("modeest")
require(modeest)
mfv(my_data$Sepal.Length)
[1] 5
Measure of variability
Measures of variability gives how “spread out” the data are.
Range: minimum & maximum
● Range corresponds to the biggest value minus the smallest value. It gives
you the full spread of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
[1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
[1] 7.9
# Range
range(my_data$Sepal.Length)
[1] 4.3 7.9
(IQR) Interquartile range
Recall that, quartiles divide the data into 4 parts. Note that the interquartile range (IQR)
- corresponding to the difference between the first and third quartiles - is sometimes used
as a robust alternative to the standard deviation.
● R function:
quantile(x, probs = seq(0, 1, 0.25))
● x: numeric vector whose sample quantiles are wanted.
● probs: numeric vector of probabilities with values in [0,1].
● Example:
quantile(my_data$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
By default, the function returns the minimum, the maximum and three quartiles
(the 0.25, 0.50 and 0.75 quartiles).
##CUSTOMISE THE INTERQUARTILE RANGE
To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:
quantile(my_data$Sepal.Length, seq(0, 1, 0.1))
To compute the interquartile range, type this:
IQR(my_data$Sepal.Length)
[1] 1.3
Variance and standard deviation
The variance represents the average squared deviation from the mean. The standard
deviation is the square root of the variance. It measures the average deviation of the
values, in the data, from the mean value.
# Compute the variance
var(my_data$Sepal.Length)
# Compute the standard deviation =
# square root of the variance
sd(my_data$Sepal.Length)
Median absolute deviation
The median absolute deviation (MAD) measures the deviation of the values, in the data,
from the median value.
# Compute the median
median(my_data$Sepal.Length)
# Compute the median absolute deviation
mad(my_data$Sepal.Length)
Which measure to use?
● Range. It’s not often used because it’s very sensitive to outliers.
● Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination
with the median.
● Variance. It’s completely uninterpretable because it doesn’t use the same
units as the data. It’s almost never used except as a mathematical tool
● Standard deviation. This is the square root of the variance. It’s expressed in
the same units as the data. The standard deviation is often used in the
situation where the mean is the measure of central tendency.
● Median absolute deviation. It’s a robust way to estimate the standard
deviation, for data with outliers. It’s not used very often.
In summary, the IQR and the standard deviation are the two most common measures
used to report the variability of the data.
Computing an overall summary of a variable and an
entire data frame
summary() function
The function summary() can be used to display several statistic summaries of either one
variable or an entire data frame.
● Summary of a single variable. Five values are returned: the mean, median,
25th and 75th quartiles, min and max in one single line call:
summary(my_data$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
● Summary of a data frame. In this case, the function summary() is
automatically applied to each column. The format of the result depends on
the type of the data contained in the column. For example:
○ If the column is a numeric variable, mean, median, min, max and
quartiles are returned.
○ If the column is a factor variable, the number of observations in
each group is returned.
summary(my_data, digits = 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
Median :6 Median :3 Median :4 Median :1.3 virginica :50
Mean :6 Mean :3 Mean :4 Mean :1.2
3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
Max. :8 Max. :4 Max. :7 Max. :2.5
sapply() function
It’s also possible to use the function sapply() to apply a particular function over a list or
vector.
For instance, we can use it to compute for each column in a data frame, the mean, sd,
var, min, quantile, …
# Compute the mean of each column
sapply(my_data[, -5], mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
# Compute quartiles
sapply(my_data[, -5], quantile)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0% 4.3 2.0 1.00 0.1
25% 5.1 2.8 1.60 0.3
50% 5.8 3.0 4.35 1.3
75% 6.4 3.3 5.10 1.8
100% 7.9 4.4 6.90 2.5
Case of missing values
Note that, when the data contains missing values, some R functions will return errors or
NA even if just a single value is missing.
For example, the mean() function will return NA if even only one value is missing in a
vector. This can be avoided using the argument na.rm = TRUE, which tells to the function
to remove any NAs before calculations. An example using the mean function is as follow:
mean(my_data$Sepal.Length, na.rm = TRUE)
Graphical display of distributions
The R package ggpubr will be used to create graphs.
Installation and loading ggpubr
● Install the latest version from GitHub as follow:
# Install
install.packages("ggpubr")
● Load ggpubr as follow:
library(ggpubr)
Box plots
ggboxplot(my_data, y = "Sepal.Length", width = 0.5)
Histogram
Histograms show the number of observations that fall within specified divisions (i.e., bins).
Histogram plot of Sepal.Length with mean line (dashed line).
gghistogram(my_data, x = "Sepal.Length", bins = 9,
add = "mean")
Empirical cumulative distribution function (ECDF)
ECDF is the fraction of data smaller than or equal to x.
ggecdf(my_data, x = "Sepal.Length")
Q-Q plots
QQ plots are used to check whether the data is normally distributed.
ggqqplot(my_data, x = "Sepal.Length")
Descriptive statistics by groups
To compute summary statistics by groups, the functions group_by() and summarise() [in
dplyr package] can be used.
● We want to group the data by Species and then:
○ compute the number of element in each group. R function: n()
○ compute the mean. R function mean()
○ and the standard deviation. R function sd()
The function %>% is used to chain operations.
● Install ddplyr as follow:
install.packages("dplyr")
● Descriptive statistics by groups:
library(dplyr)
group_by(my_data, Species) %>%
summarise(
count = n(),
mean = mean(Sepal.Length, na.rm = TRUE),
sd = sd(Sepal.Length, na.rm = TRUE)
)
Source: local data frame [3 x 4]
Species count mean sd
(fctr) (int) (dbl) (dbl)
1 setosa 50 5.006 0.3524897
2 versicolor 50 5.936 0.5161711
3 virginica 50 6.588 0.6358796
● Graphics for grouped data:
library("ggpubr")
# Box plot colored by groups: Species
ggboxplot(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"))
# Stripchart colored by groups: Species
ggstripchart(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "mean_sd")
Note that, when the number of observations per groups is small, it’s recommended to
use strip chart compared to box plots.
Summerization notes for descriptive statistics using r

More Related Content

What's hot

Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Summary statistics
Summary statisticsSummary statistics
Summary statisticsRupak Roy
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDSCUSICT
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsRupak Roy
 
Chapter 04-discriminant analysis
Chapter 04-discriminant analysisChapter 04-discriminant analysis
Chapter 04-discriminant analysisRaman Kannan
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesDataminingTools Inc
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forestsMarc Garcia
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpankit_ppt
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is funZhen Li
 

What's hot (20)

Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
Chapter 04-discriminant analysis
Chapter 04-discriminant analysisChapter 04-discriminant analysis
Chapter 04-discriminant analysis
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Random forest
Random forestRandom forest
Random forest
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
DCSM report2
DCSM report2DCSM report2
DCSM report2
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Random forest
Random forestRandom forest
Random forest
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 

Similar to Summerization notes for descriptive statistics using r

R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxSreeLaya9
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
 
Sas basis imp intrw ques
Sas basis imp intrw quesSas basis imp intrw ques
Sas basis imp intrw quesVinod Kumar
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptxnikshaikh786
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
C programming session 04
C programming session 04C programming session 04
C programming session 04Dushmanta Nath
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
Two dimensional array
Two dimensional arrayTwo dimensional array
Two dimensional arrayRajendran
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-LearnDucat India
 
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docxStrategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docxcpatriciarpatricia
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 

Similar to Summerization notes for descriptive statistics using r (20)

Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
Numpy ndarrays.pdf
Numpy ndarrays.pdfNumpy ndarrays.pdf
Numpy ndarrays.pdf
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Sas basis imp intrw ques
Sas basis imp intrw quesSas basis imp intrw ques
Sas basis imp intrw ques
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
User biglm
User biglmUser biglm
User biglm
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
C programming session 04
C programming session 04C programming session 04
C programming session 04
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
Two dimensional array
Two dimensional arrayTwo dimensional array
Two dimensional array
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-Learn
 
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docxStrategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 

More from Ashwini Mathur

S3 classes and s4 classes
S3 classes and s4 classesS3 classes and s4 classes
S3 classes and s4 classesAshwini Mathur
 
R programming lab 3 - jupyter notebook
R programming lab   3 - jupyter notebookR programming lab   3 - jupyter notebook
R programming lab 3 - jupyter notebookAshwini Mathur
 
R programming lab 2 - jupyter notebook
R programming lab   2 - jupyter notebookR programming lab   2 - jupyter notebook
R programming lab 2 - jupyter notebookAshwini Mathur
 
R programming lab 1 - jupyter notebook
R programming lab   1 - jupyter notebookR programming lab   1 - jupyter notebook
R programming lab 1 - jupyter notebookAshwini Mathur
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1Ashwini Mathur
 
Object oriented programming in r
Object oriented programming in r Object oriented programming in r
Object oriented programming in r Ashwini Mathur
 
Linear programming optimization in r
Linear programming optimization in r Linear programming optimization in r
Linear programming optimization in r Ashwini Mathur
 
Introduction to python along with the comparitive analysis with r
Introduction to python   along with the comparitive analysis with r Introduction to python   along with the comparitive analysis with r
Introduction to python along with the comparitive analysis with r Ashwini Mathur
 
Example 2 summerization notes for descriptive statistics using r
Example   2    summerization notes for descriptive statistics using r Example   2    summerization notes for descriptive statistics using r
Example 2 summerization notes for descriptive statistics using r Ashwini Mathur
 
Descriptive statistics assignment
Descriptive statistics assignment Descriptive statistics assignment
Descriptive statistics assignment Ashwini Mathur
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming languageAshwini Mathur
 
Data analysis for covid 19
Data analysis for covid 19Data analysis for covid 19
Data analysis for covid 19Ashwini Mathur
 
Correlation and linear regression
Correlation and linear regression Correlation and linear regression
Correlation and linear regression Ashwini Mathur
 
Calling c functions from r programming unit 5
Calling c functions from r programming    unit 5Calling c functions from r programming    unit 5
Calling c functions from r programming unit 5Ashwini Mathur
 
Anova (analysis of variance) test
Anova (analysis of variance) test Anova (analysis of variance) test
Anova (analysis of variance) test Ashwini Mathur
 

More from Ashwini Mathur (19)

S3 classes and s4 classes
S3 classes and s4 classesS3 classes and s4 classes
S3 classes and s4 classes
 
Rpy2 demonstration
Rpy2 demonstrationRpy2 demonstration
Rpy2 demonstration
 
Rpy package
Rpy packageRpy package
Rpy package
 
R programming lab 3 - jupyter notebook
R programming lab   3 - jupyter notebookR programming lab   3 - jupyter notebook
R programming lab 3 - jupyter notebook
 
R programming lab 2 - jupyter notebook
R programming lab   2 - jupyter notebookR programming lab   2 - jupyter notebook
R programming lab 2 - jupyter notebook
 
R programming lab 1 - jupyter notebook
R programming lab   1 - jupyter notebookR programming lab   1 - jupyter notebook
R programming lab 1 - jupyter notebook
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1
 
R for statistics 2
R for statistics 2R for statistics 2
R for statistics 2
 
Play with matrix in r
Play with matrix in rPlay with matrix in r
Play with matrix in r
 
Object oriented programming in r
Object oriented programming in r Object oriented programming in r
Object oriented programming in r
 
Linear programming optimization in r
Linear programming optimization in r Linear programming optimization in r
Linear programming optimization in r
 
Introduction to python along with the comparitive analysis with r
Introduction to python   along with the comparitive analysis with r Introduction to python   along with the comparitive analysis with r
Introduction to python along with the comparitive analysis with r
 
Example 2 summerization notes for descriptive statistics using r
Example   2    summerization notes for descriptive statistics using r Example   2    summerization notes for descriptive statistics using r
Example 2 summerization notes for descriptive statistics using r
 
Descriptive statistics assignment
Descriptive statistics assignment Descriptive statistics assignment
Descriptive statistics assignment
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
 
Data analysis for covid 19
Data analysis for covid 19Data analysis for covid 19
Data analysis for covid 19
 
Correlation and linear regression
Correlation and linear regression Correlation and linear regression
Correlation and linear regression
 
Calling c functions from r programming unit 5
Calling c functions from r programming    unit 5Calling c functions from r programming    unit 5
Calling c functions from r programming unit 5
 
Anova (analysis of variance) test
Anova (analysis of variance) test Anova (analysis of variance) test
Anova (analysis of variance) test
 

Recently uploaded

Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 

Summerization notes for descriptive statistics using r

  • 1. DESCRIPTIVE STATISTICS USING R PROGRAMMING ALONG WITH THE DEMONSTRATION IRIS FLOWER FEATURES: IRIS-DATA-SET:
  • 2. Descriptive statistics consist of describing simply the data using some summary statistics and graphics. Import your data into R 1. Prepare your data 2. Save your data in an external .txt tab or .csv files 3. Import your data into R as follow: # If .txt tab file, use this my_data <- read.delim(file.choose()) # Or, if .csv file, use this my_data <- read.csv(file.choose()) Here, we’ll use the built-in R data set named iris. # Store the data in the variable my_data my_data <- iris
  • 3. Check your data You can inspect your data using the functions head() and tails(), which will display the first and the last part of the data, respectively. # Print the first 6 rows head(my_data, 6) ---------------------------------------------------------------------------------- Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
  • 4. R functions for computing descriptive statistics Some R functions for computing descriptive statistics: Description R function Mean mean() Standard deviation sd() Variance var() Minimum min() Maximum maximum() Median median() Range of values (minimum and maximum) range() Sample quantiles quantile() Generic function summary() Interquartile range IQR() The function mfv(), for most frequent value, [in modeest package] can be used to find the statistical mode of a numeric vector. (For Finding the MODE)
  • 5. Descriptive statistics for a single group Measure of central tendency: mean, median, mode The central tendency measures the “average” or the “middle” of your data. The most commonly used measures include: ● the mean: the average value. It’s sensitive to outliers. ● the median: the middle value. It’s a robust alternative to mean. ● and the mode: the most frequent value In R, ● The function mean() and median() can be used to compute the mean and the median, respectively; ● The function mfv() [in the modeest R package] can be used to compute the mode of a variable. The R code below computes the mean, median and the mode of the variable Sepal.Length [in my_data data set]: # Compute the mean value mean(my_data$Sepal.Length) [1] 5.843333 # Compute the median value median(my_data$Sepal.Length)
  • 6. [1] 5.8 # Compute the mode # install.packages("modeest") require(modeest) mfv(my_data$Sepal.Length) [1] 5
  • 7. Measure of variability Measures of variability gives how “spread out” the data are. Range: minimum & maximum ● Range corresponds to the biggest value minus the smallest value. It gives you the full spread of the data. # Compute the minimum value min(my_data$Sepal.Length) [1] 4.3 # Compute the maximum value max(my_data$Sepal.Length) [1] 7.9 # Range range(my_data$Sepal.Length) [1] 4.3 7.9
  • 8. (IQR) Interquartile range Recall that, quartiles divide the data into 4 parts. Note that the interquartile range (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation. ● R function: quantile(x, probs = seq(0, 1, 0.25)) ● x: numeric vector whose sample quantiles are wanted. ● probs: numeric vector of probabilities with values in [0,1]. ● Example: quantile(my_data$Sepal.Length) 0% 25% 50% 75% 100% 4.3 5.1 5.8 6.4 7.9 By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles). ##CUSTOMISE THE INTERQUARTILE RANGE To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this: quantile(my_data$Sepal.Length, seq(0, 1, 0.1)) To compute the interquartile range, type this: IQR(my_data$Sepal.Length)
  • 9. [1] 1.3 Variance and standard deviation The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value. # Compute the variance var(my_data$Sepal.Length) # Compute the standard deviation = # square root of the variance sd(my_data$Sepal.Length) Median absolute deviation The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value. # Compute the median median(my_data$Sepal.Length) # Compute the median absolute deviation mad(my_data$Sepal.Length) Which measure to use? ● Range. It’s not often used because it’s very sensitive to outliers. ● Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination with the median. ● Variance. It’s completely uninterpretable because it doesn’t use the same units as the data. It’s almost never used except as a mathematical tool ● Standard deviation. This is the square root of the variance. It’s expressed in the same units as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency. ● Median absolute deviation. It’s a robust way to estimate the standard deviation, for data with outliers. It’s not used very often.
  • 10. In summary, the IQR and the standard deviation are the two most common measures used to report the variability of the data. Computing an overall summary of a variable and an entire data frame summary() function The function summary() can be used to display several statistic summaries of either one variable or an entire data frame. ● Summary of a single variable. Five values are returned: the mean, median, 25th and 75th quartiles, min and max in one single line call: summary(my_data$Sepal.Length) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.300 5.100 5.800 5.843 6.400 7.900 ● Summary of a data frame. In this case, the function summary() is automatically applied to each column. The format of the result depends on the type of the data contained in the column. For example: ○ If the column is a numeric variable, mean, median, min, max and quartiles are returned. ○ If the column is a factor variable, the number of observations in each group is returned. summary(my_data, digits = 1) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50 1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50 Median :6 Median :3 Median :4 Median :1.3 virginica :50
  • 11. Mean :6 Mean :3 Mean :4 Mean :1.2 3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8 Max. :8 Max. :4 Max. :7 Max. :2.5 sapply() function It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it to compute for each column in a data frame, the mean, sd, var, min, quantile, … # Compute the mean of each column sapply(my_data[, -5], mean) Sepal.Length Sepal.Width Petal.Length Petal.Width 5.843333 3.057333 3.758000 1.199333 # Compute quartiles sapply(my_data[, -5], quantile) Sepal.Length Sepal.Width Petal.Length Petal.Width 0% 4.3 2.0 1.00 0.1 25% 5.1 2.8 1.60 0.3 50% 5.8 3.0 4.35 1.3 75% 6.4 3.3 5.10 1.8 100% 7.9 4.4 6.90 2.5
  • 12. Case of missing values Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing. For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells to the function to remove any NAs before calculations. An example using the mean function is as follow: mean(my_data$Sepal.Length, na.rm = TRUE)
  • 13. Graphical display of distributions The R package ggpubr will be used to create graphs. Installation and loading ggpubr ● Install the latest version from GitHub as follow: # Install install.packages("ggpubr") ● Load ggpubr as follow: library(ggpubr) Box plots ggboxplot(my_data, y = "Sepal.Length", width = 0.5)
  • 15. Histograms show the number of observations that fall within specified divisions (i.e., bins). Histogram plot of Sepal.Length with mean line (dashed line). gghistogram(my_data, x = "Sepal.Length", bins = 9, add = "mean") Empirical cumulative distribution function (ECDF)
  • 16. ECDF is the fraction of data smaller than or equal to x. ggecdf(my_data, x = "Sepal.Length")
  • 17. Q-Q plots QQ plots are used to check whether the data is normally distributed. ggqqplot(my_data, x = "Sepal.Length")
  • 18. Descriptive statistics by groups To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used. ● We want to group the data by Species and then: ○ compute the number of element in each group. R function: n() ○ compute the mean. R function mean() ○ and the standard deviation. R function sd() The function %>% is used to chain operations. ● Install ddplyr as follow: install.packages("dplyr") ● Descriptive statistics by groups: library(dplyr) group_by(my_data, Species) %>% summarise( count = n(), mean = mean(Sepal.Length, na.rm = TRUE), sd = sd(Sepal.Length, na.rm = TRUE) ) Source: local data frame [3 x 4] Species count mean sd (fctr) (int) (dbl) (dbl) 1 setosa 50 5.006 0.3524897 2 versicolor 50 5.936 0.5161711 3 virginica 50 6.588 0.6358796 ● Graphics for grouped data: library("ggpubr") # Box plot colored by groups: Species
  • 19. ggboxplot(my_data, x = "Species", y = "Sepal.Length", color = "Species", palette = c("#00AFBB", "#E7B800", "#FC4E07"))
  • 20. # Stripchart colored by groups: Species ggstripchart(my_data, x = "Species", y = "Sepal.Length", color = "Species", palette = c("#00AFBB", "#E7B800", "#FC4E07"), add = "mean_sd") Note that, when the number of observations per groups is small, it’s recommended to use strip chart compared to box plots.