Introduction to Data Analysis With R and R Studio

Data Analysis with
R & R Studio
Introduction
drtamil@gmail.com

Download & Install
• You can download and install R for free from
https://r-project.org/
• Upon installation, download and install the
free version of R Studio Desktop from
https://rstudio.com
• Instructions at
https://youtu.be/hXb47dmPCR8
drtamil@gmail.com

Uniqueness of R & R Studio
• R is one of the programming languages that provide an
intensive environment for you to analyze, process,
transform and visualize information.
• It is the primary choice for many statisticians who want
to involve themselves in designing statistical models for
solving complex problems.
• Data are usually entered and manipulated using
spreadsheet such as Microsoft Excel.
• Specific analysis requires specific commands. So you
must know exactly what command is required for the
analysis.
drtamil@gmail.com

Choosing the appropriate
statistical tests
Use these tables to choose the
appropriate statistical tests.
drtamil@gmail.com

Parametric Statistical Tests
Variable 1 Variable 2 Criteria Type of Test
Qualitative Qualitative Sample size > 20 dan no
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
drtamil@gmail.com

Non-parametric Statistical Tests
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size < 20 or (< 40 but
with at least one expected
value < 5)
Fisher Test
Qualitative
Dichotomus
Quantitative Data not normally distributed Wilcoxon Rank Sum
Test or U Mann-
Whitney Test
Qualitative
Polinomial
Quantitative Data not normally distributed Kruskal-Wallis One
Way ANOVA Test
same individual & item
Wilcoxon Rank Sign
Test
Quantitative -
continous/ordina
l
Quantitative -
continous
Data not normally distributed Spearman/Kendall
Rank Correlation
drtamil@gmail.com

Statistical Tests for Qualitative Data
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
X2
Test with Yates
Correction
Qualitative
Dichotomus
Qualitative
Polinomial
Paired t Test
Qualitative
Dichotomus
Qualitative
Dichotomus
value < 5)
Fisher Test
Qualitative
Dichotomus
Test or U Mann-
Whitney Test
Qualitative Quantitative Data not normally distributed Kruskal-Wallis Onedrtamil@gmail.com

R Hands-on Exercise
Text in this blue colour are the
commands to be typed in
the Console window.
drtamil@gmail.com

URL for data & submit answers
• Data -
https://drive.google.com/file/d/1PzcqCzm5t9KQk
kXAtlvO56bZlMojM8-b/view?usp=sharing
• The analysis required https://wp.me/p4mYLF-vA
• Submit answers at this link
https://docs.google.com/forms/d/1o_L7ZjXF9Q1
PON2zDs_VwkKsLCHT4v-
8WruXhCiVq2Q/viewform
drtamil@gmail.com

Data – Factors Related to SGA
drtamil@gmail.com

A study to identify factors that can cause small for gestational
age (SGA) was conducted. Among the factors studied were the
mothers’ body mass index (BMI). It is believed that mothers with
lower BMI were of higher risk to get SGA babies.
• 1. Create a new variable mBMI (Mothers’
Body Mass Index) from the mothers’
HEIGHT (in metre) & WEIGHT (first
trimester weight in kg). mBMI = weight in
kg/(height in metre)2. Calculate the
following for mBMI;
– Mean
– Standard deviation
• 2. Create a new variable OBESCLAS
(Classification of Obesity) from mBMI. Use
the following cutoff point;
– <20 = Underweight
– 20 – 24.99 = Normal
– 25 or larger = Overweight
– Create a frequency table for OBESCLAS.
• 3. Conduct the appropriate statistical test
to test whether there is any association
between OBESCLAS (Underweight/
Normal/Overweight) and OUTCOME.
to test whether there is any association
between BMI and OUTCOME.
to find any association between OBESCLAS
(Underweight/Normal/Overweight) and
BIRTHWGT.
• 6. Assuming that both variables mBMI &
BIRTHWGT are normally distributed,
conduct an appropriate statistical test to
prove the association between the two
variables.
– Demonstrate the association using the
appropriate chart. Determine the
coefficient of determination.
• 7. Conduct Simple Linear Regression using
BIRTHWGT as the dependent variable. Try
to come out with a formula that will
predict the baby’s birthweight based on
the mother’s BMI.
– y = a + bx
drtamil@gmail.com

Online form for answers
drtamil@gmail.com
https://docs.google.com/forms/d/1o_L7ZjXF9Q1PON2zDs_VwkKsLCHT4v-8WruXhCiVq2Q/viewform

New R project titled “Tutor”.
drtamil@gmail.com

Import Excel into R Studio
• Select the Excel file you
downloaded earlier;
“SGA.xls”
drtamil@gmail.com

Import Excel into R Studio
• Click “Import” and the
following command are
executed;
– library(readxl)
– sga <- read_excel
("C:/…./sga.xls")
– View(sga)
drtamil@gmail.com

R Studio - compute
A study to identify factors that can cause small for gestational
age (SGA) was conducted. Among the factors studied were
the mothers’ body mass index (BMI). It is believed that
mothers with lower BMI were of higher risk to get SGA
babies.
1. Create a new variable mBMI (Mothers’ Body Mass Index)
from the mothers’ HEIGHT (in metre) & WEIGHT (first
trimester weight in kg). mBMI = weight in kg/(height in
metre)2. Calculate the following for mBMI;
– Mean
– Standard deviation
Copy and paste your answers into your Word file.

Compute mBMI=weight/(height/100)^2
• sga$mBMI <-
(sga$weight/(sga$height)^2)
• View(sga)
• mean(sga$mBMI)
– [1] 24.49576
• sd(sga$mBMI)
– [1] 4.767109

Question 1 – BMI
• mean(sga$mBMI)
– [1] 24.49576
• sd(sga$mBMI)
– [1] 4.767109

Recode
• 2. Create a new variable OBESCLAS
(Classification of Obesity) from mBMI. Use the
following cutoff point;
– <20 = Underweight
– 20 – 24.99 = Normal
– 25 or larger = Overweight
– Create a frequency table for OBESCLAS.
drtamil@gmail.com

Recode mBMI into OBESCLAS
• sga$obesclas<-""
• sga$obesclas[sga$mBMI<20] <- 1
• sga$obesclas[sga$mBMI>=20 &
sga$mBMI<25] <- 2
• sga$obesclas[sga$mBMI>=25] <- 3
• table(sga$obesclas)
• sga$obesclas <-
factor(sga$obesclas, levels =
c(1,2,3),labels = c('Under',
'Normal', 'Over'))
drtamil@gmail.com

Frequency table for OBESCLAS
– Under Normal Over
– 17 40 43
• prop.table(table(sga$obesclas))
– 0.17 0.40 0.43
– 17% 40% 43%
drtamil@gmail.com

Question 2 – Obese Classification
– 17 40 43
• prop.table(table(sga$obesclas))
– 0.17 0.40 0.43
– 17% 40% 43%
drtamil@gmail.com

Exercise 3
• 3. Conduct the
appropriate statistical
test to test whether
there is any association
between OBESCLAS
(Underweight/Normal/
Overweight) and
OUTCOME.
• Therefore most suitable
analysis is Pearson Chi-
square.
SGA Normal TOTAL
UnderW
Normal
OverW
TOTAL 50 50 100
drtamil@gmail.com
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
X2
Test with Yates
Correction
Qualitative
Dichotomus
Qualitative
Polinomial
distributed data
Paired t Test
Qualitative
Dichotomus
Qualitative
Dichotomus
value < 5)
Fisher Test
Qualitative
Dichotomus
Test or U Mann-
Whitney Test
Qualitative
Polinomial
Quantitative Data not normally distributed Kruskal-Wallis One
Way ANOVA Test

Chi-Square Analysis
• library(gmodels)
• CrossTable(sga$obesclas,
sga$outcome, digits=2,
max.width = 5,
expected=TRUE,
prop.r=TRUE, prop.c=FALSE,
prop.t=FALSE,
prop.chisq=FALSE,
chisq=TRUE, format="SPSS")
– Pearson's Chi-squared test
– Chi^2 = 24.39111 d.f. = 2
p = 5.052871e-06
– Minimum expected frequency:
8.5
drtamil@gmail.com

Chi-Square Results from R Studio
• R not only states that there is a significant association
(p=5x10-6) between mother’s weight classification and
small for gestational age.
• But it also show which group has the higher rate of SGA.
drtamil@gmail.com

Results From R Studio
• Underweight mothers
has a higher rate (94%)
of SGA, compared to
normal mothers (58%)
and overweight
mothers (26%).
drtamil@gmail.com

Exercise 4
• 4. Conduct the
appropriate statistical test
to test whether there is
any association between
BMI and OUTCOME.
• Basically we are
comparing the mean BMI
of SGA babies’ mothers
against mean BMI of
Normal babies’ mothers.
• Therefore the appropriate
test is Student’s t-test.
drtamil@gmail.com
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
X2
Test with Yates
Correction
Qualitative
Dichotomus
Qualitative
Polinomial
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
& Linear
Regresssion

Student’s T-Test
• library("car")
• leveneTest(sga$mBMI,
sga$outcome)
– Levene's Test for
Homogeneity of
Variance
(center = median)
– Df F value Pr(>F)
– group 1 0.0827 0.7743
– 98
• Levene test reveals that
variances are not
significantly different
(P = 0.7743).
• Therefore when we run
the t-test, it is for equal
variances.
drtamil@gmail.com

T-Test Results from Studio R
• t.test(sga$mBMI ~
sga$outcome, var.equal=TRUE)
– Two Sample t-test
– data: sga$mBMI by sga$outcome
– t = 4.5164, df = 98, p-value =
1.756e-05
– alternative hypothesis: true
difference in means is not equal
to 0
– 95 percent confidence interval:
2.207433 5.667658
– sample estimates:
– mean in group Normal 26.46453
– mean in group SGA 22.52699
• Studio R states that there is
a significant mean
difference of BMI (p =
1.756x10-5) between SGA
babies’ mothers (22.52) and
normal babies’ mothers
(26.46).
• Therefore mean BMI of SGA
babies’ mothers is
significantly lower than the
mean BMI of normal babies’
mothers.
drtamil@gmail.com

Exercise 5
• 5. Conduct the appropriate statistical test to find
any association between OBESCLAS
(Underweight/Normal/Overweight) and
BIRTHWGT.
• Basically we are comparing the mean
BIRTHWEIGHT of underweight mothers, normal
weight mothers and overweight mothers.
• Therefore the appropriate test is Analysis of
Variance (ANOVA).
drtamil@gmail.com

ANOVA
• library("car")
• leveneTest(sga$birthwgt,
sga$obesclas)
– Levene's Test for
Homogeneity of Variance
(center = median)
• Df F value Pr(>F)
– group 2 3.1702 0.04638 *
– 97
– ---
– Signif. codes: 0 ‘***’ 0.001
‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• Variance of birthwgt are
significantly different
between the groups
obesclas.
• Therefore when we run
the ANOVA, it is for
unequal variances.
drtamil@gmail.com

ANOVA – command
• tapply(sga$birthwgt, sga$obesclas, mean)
– 2.187059 2.768250 3.245116
• tapply(sga$birthwgt, sga$obesclas, sd)
– 0.3403999 0.6712861 0.6606179
• levels(sga$obesclas)
• summary(aov(sga$birthwgt ~ sga$obesclas))
drtamil@gmail.com

ANOVA – Results
• > levels(sga$obesclas)
• [1] "Under" "Normal" "Over"
• > summary(aov(sga$birthwgt ~ sga$obesclas))
• Df Sum Sq Mean Sq F value Pr(>F)
• sga$obesclas 2 14.39 7.196 18.49 1.58e-07 ***
• Residuals 97 37.76 0.389
• ---
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
drtamil@gmail.com

ANOVA Results from Studio R
• Studio R states that there is a significant mean difference of mean
birth weight (p < 0.0001) between underweight mothers’ (2.187),
normal mothers ‘(2.768) & overweight mothers’(3.245).
• Unfortunately it also proves that there is unequal variances of the
three means. So it fails the homogeneity of variances assumption.
drtamil@gmail.com

ANOVA Results – post hoc
• Post-hoc tests indicate there is significant
difference of birth weight between ALL the
three groups.
drtamil@gmail.com
pairwise.t.test(sga$birthwgt, sga$obesclas, p.adjust.method ="bonferroni")

Question 5
drtamil@gmail.com
tapply(sga$birthwgt, sga$obesclas, mean)
Under Normal Over
2.187059 2.768250 3.245116
tapply(sga$birthwgt, sga$obesclas, sd)
Under Normal Over
0.3403999 0.6712861 0.6606179

Exercise 6
• 6. Assuming that both variables
mBMI & BIRTHWGT are normally
distributed, conduct an appropriate
statistical test to prove the
association between the two
variables.
–Demonstrate the association using the
appropriate chart. Determine the
coefficient of determination.
drtamil@gmail.com

Pearson Correlation
• mBMI and birth weight are both normally distributed
continuous data. Since the aim is to measure the
strength and direction of the association between
these two continuous variable, therefore Pearson
Correlation is the most appropriate test.
drtamil@gmail.com
expected value < 5
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
X2
Test with Yates
Correction
Qualitative
Dichotomus
Qualitative
Polinomial
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
& Linear
Regresssion

Pearson’s Correlation
Command
• cor.test(sga$mBMI, sga$birthwgt,
method="pearson")
– Pearson's product-moment
correlation
– data: sga$mBMI and sga$birthwgt
– t = 5.4379, df = 98, p-value =
3.959e-07
– alternative hypothesis: true
correlation is not equal to 0
– 95 percent confidence interval:
– 0.3148037 0.6193051
– sample estimates:
– cor
– 0.4814521
Discussion
• r = 0.4814521
• p-value = 3.959 x 10-7
• Fair & positive correlation
between mBMI and Birthweight.
• Therefore as mothers’ BMI
increases, the birth weight also
increases.
• r2 =0.48145212 = 0.2318
• 23.18% (r2=0.2318) variability of
the birth weight is determined by
the variability of the mothers’
BMI.
drtamil@gmail.com

plot(x = sga$mBMI, y = sga$birthwgt, type = 'p')
drtamil@gmail.com

Question 6
drtamil@gmail.com
r2 = 0.48145212 = 0.2318

Exercise 7
• 7. Conduct Simple Linear Regression
using BIRTHWGT as the dependent
variable. Try to come out with a
formula that will predict the baby’s
birth weight based on the mother’s
BMI.
–y = a + bx
drtamil@gmail.com

Simple Linear Regression
• mBMI and birth weight are both normally distributed
continuous data. Since the aim is to come out with a
regression formula between these two continuous
variable, therefore Simple Linear Regression is the
most appropriate test.
drtamil@gmail.com
expected value < 5
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
Qualitative
Dichotomus
X2
Test with Yates
Correction
Qualitative
Dichotomus
Qualitative
Polinomial
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
& Linear
Regresssion

plot(x = sga$mBMI, y = sga$birthwgt, type = 'p')
abline(lm(sga$birthwgt ~ sga$mBMI), col=‘red’, lty=2)
drtamil@gmail.com

Simple Linear Regression
• summary(lm(sga$birthwgt ~ sga$mBMI))
drtamil@gmail.com

SLR Results from Studio R
• Studio R states that there is a significant regression
coefficient (b=0.07330).
• The constant (a) is 1.07895
• 23.18% (r2=0.2318) variability of the birth weight is
determined by the variability of the mothers’ BMI.
• BW = 1.079 + 0.073BMI
• For every increase of BMI of 1 unit, BW increases 0.07kg.
drtamil@gmail.com

The End
Credits to Prof Lin Naing @ Ayub
for the original notes.
drtamil@gmail.com

Introduction to Data Analysis With R and R Studio

More Related Content

What's hot

Similar to Introduction to Data Analysis With R and R Studio

More from Azmi Mohd Tamil

Recently uploaded

Introduction to Data Analysis With R and R Studio