Data Analysis with
R & R Studio
Introduction
drtamil@gmail.com
Download & Install
• You can download and install R for free from
https://r-project.org/
• Upon installation, download and install the
free version of R Studio Desktop from
https://rstudio.com
• Instructions at
https://youtu.be/hXb47dmPCR8
drtamil@gmail.com
Uniqueness of R & R Studio
• R is one of the programming languages that provide an
intensive environment for you to analyze, process,
transform and visualize information.
• It is the primary choice for many statisticians who want
to involve themselves in designing statistical models for
solving complex problems.
• Data are usually entered and manipulated using
spreadsheet such as Microsoft Excel.
• Specific analysis requires specific commands. So you
must know exactly what command is required for the
analysis.
drtamil@gmail.com
Choosing the appropriate
statistical tests
Use these tables to choose the
appropriate statistical tests.
drtamil@gmail.com
Parametric Statistical Tests
Variable 1 Variable 2 Criteria Type of Test
Qualitative Qualitative Sample size > 20 dan no
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
drtamil@gmail.com
Non-parametric Statistical Tests
Variable 1 Variable 2 Criteria Type of Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size < 20 or (< 40 but
with at least one expected
value < 5)
Fisher Test
Qualitative
Dichotomus
Quantitative Data not normally distributed Wilcoxon Rank Sum
Test or U Mann-
Whitney Test
Qualitative
Polinomial
Quantitative Data not normally distributed Kruskal-Wallis One
Way ANOVA Test
Quantitative Quantitative Repeated measurement of the
same individual & item
Wilcoxon Rank Sign
Test
Quantitative -
continous/ordina
l
Quantitative -
continous
Data not normally distributed Spearman/Kendall
Rank Correlation
drtamil@gmail.com
Statistical Tests for Qualitative Data
Variable 1 Variable 2 Criteria Type of Test
Qualitative Qualitative Sample size > 20 dan no
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
Paired t Test
Variable 1 Variable 2 Criteria Type of Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size < 20 or (< 40 but
with at least one expected
value < 5)
Fisher Test
Qualitative
Dichotomus
Quantitative Data not normally distributed Wilcoxon Rank Sum
Test or U Mann-
Whitney Test
Qualitative Quantitative Data not normally distributed Kruskal-Wallis Onedrtamil@gmail.com
R Hands-on Exercise
Text in this blue colour are the
commands to be typed in
the Console window.
drtamil@gmail.com
URL for data & submit answers
• Data -
https://drive.google.com/file/d/1PzcqCzm5t9KQk
kXAtlvO56bZlMojM8-b/view?usp=sharing
• The analysis required https://wp.me/p4mYLF-vA
• Submit answers at this link
https://docs.google.com/forms/d/1o_L7ZjXF9Q1
PON2zDs_VwkKsLCHT4v-
8WruXhCiVq2Q/viewform
drtamil@gmail.com
Data – Factors Related to SGA
drtamil@gmail.com
A study to identify factors that can cause small for gestational
age (SGA) was conducted. Among the factors studied were the
mothers’ body mass index (BMI). It is believed that mothers with
lower BMI were of higher risk to get SGA babies.
• 1. Create a new variable mBMI (Mothers’
Body Mass Index) from the mothers’
HEIGHT (in metre) & WEIGHT (first
trimester weight in kg). mBMI = weight in
kg/(height in metre)2. Calculate the
following for mBMI;
– Mean
– Standard deviation
• 2. Create a new variable OBESCLAS
(Classification of Obesity) from mBMI. Use
the following cutoff point;
– <20 = Underweight
– 20 – 24.99 = Normal
– 25 or larger = Overweight
– Create a frequency table for OBESCLAS.
• 3. Conduct the appropriate statistical test
to test whether there is any association
between OBESCLAS (Underweight/
Normal/Overweight) and OUTCOME.
• 4. Conduct the appropriate statistical test
to test whether there is any association
between BMI and OUTCOME.
• 5. Conduct the appropriate statistical test
to find any association between OBESCLAS
(Underweight/Normal/Overweight) and
BIRTHWGT.
• 6. Assuming that both variables mBMI &
BIRTHWGT are normally distributed,
conduct an appropriate statistical test to
prove the association between the two
variables.
– Demonstrate the association using the
appropriate chart. Determine the
coefficient of determination.
• 7. Conduct Simple Linear Regression using
BIRTHWGT as the dependent variable. Try
to come out with a formula that will
predict the baby’s birthweight based on
the mother’s BMI.
– y = a + bx
drtamil@gmail.com
Online form for answers
drtamil@gmail.com
https://docs.google.com/forms/d/1o_L7ZjXF9Q1PON2zDs_VwkKsLCHT4v-8WruXhCiVq2Q/viewform
New R project titled “Tutor”.
drtamil@gmail.com
Import Excel into R Studio
• Select the Excel file you
downloaded earlier;
“SGA.xls”
drtamil@gmail.com
Import Excel into R Studio
• Click “Import” and the
following command are
executed;
– library(readxl)
– sga <- read_excel
("C:/…./sga.xls")
– View(sga)
drtamil@gmail.com
R Studio - compute
A study to identify factors that can cause small for gestational
age (SGA) was conducted. Among the factors studied were
the mothers’ body mass index (BMI). It is believed that
mothers with lower BMI were of higher risk to get SGA
babies.
1. Create a new variable mBMI (Mothers’ Body Mass Index)
from the mothers’ HEIGHT (in metre) & WEIGHT (first
trimester weight in kg). mBMI = weight in kg/(height in
metre)2. Calculate the following for mBMI;
– Mean
– Standard deviation
Copy and paste your answers into your Word file.
Compute mBMI=weight/(height/100)^2
• sga$mBMI <-
(sga$weight/(sga$height)^2)
• View(sga)
• mean(sga$mBMI)
– [1] 24.49576
• sd(sga$mBMI)
– [1] 4.767109
Question 1 – BMI
• mean(sga$mBMI)
– [1] 24.49576
• sd(sga$mBMI)
– [1] 4.767109
Recode
• 2. Create a new variable OBESCLAS
(Classification of Obesity) from mBMI. Use the
following cutoff point;
– <20 = Underweight
– 20 – 24.99 = Normal
– 25 or larger = Overweight
– Create a frequency table for OBESCLAS.
drtamil@gmail.com
Recode mBMI into OBESCLAS
• sga$obesclas<-""
• sga$obesclas[sga$mBMI<20] <- 1
• sga$obesclas[sga$mBMI>=20 &
sga$mBMI<25] <- 2
• sga$obesclas[sga$mBMI>=25] <- 3
• table(sga$obesclas)
• sga$obesclas <-
factor(sga$obesclas, levels =
c(1,2,3),labels = c('Under',
'Normal', 'Over'))
• table(sga$obesclas)
drtamil@gmail.com
Frequency table for OBESCLAS
• table(sga$obesclas)
– Under Normal Over
– 17 40 43
• prop.table(table(sga$obesclas))
– Under Normal Over
– 0.17 0.40 0.43
– 17% 40% 43%
drtamil@gmail.com
Question 2 – Obese Classification
• table(sga$obesclas)
– Under Normal Over
– 17 40 43
• prop.table(table(sga$obesclas))
– Under Normal Over
– 0.17 0.40 0.43
– 17% 40% 43%
drtamil@gmail.com
Exercise 3
• 3. Conduct the
appropriate statistical
test to test whether
there is any association
between OBESCLAS
(Underweight/Normal/
Overweight) and
OUTCOME.
• Therefore most suitable
analysis is Pearson Chi-
square.
SGA Normal TOTAL
UnderW
Normal
OverW
TOTAL 50 50 100
drtamil@gmail.com
Variable 1 Variable 2 Criteria Type of Test
Qualitative Qualitative Sample size > 20 dan no
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Variable 1 Variable 2 Criteria Type of Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size < 20 or (< 40 but
with at least one expected
value < 5)
Fisher Test
Qualitative
Dichotomus
Quantitative Data not normally distributed Wilcoxon Rank Sum
Test or U Mann-
Whitney Test
Qualitative
Polinomial
Quantitative Data not normally distributed Kruskal-Wallis One
Way ANOVA Test
Chi-Square Analysis
• library(gmodels)
• CrossTable(sga$obesclas,
sga$outcome, digits=2,
max.width = 5,
expected=TRUE,
prop.r=TRUE, prop.c=FALSE,
prop.t=FALSE,
prop.chisq=FALSE,
chisq=TRUE, format="SPSS")
– Pearson's Chi-squared test
– Chi^2 = 24.39111 d.f. = 2
p = 5.052871e-06
– Minimum expected frequency:
8.5
drtamil@gmail.com
Chi-Square Results from R Studio
• R not only states that there is a significant association
(p=5x10-6) between mother’s weight classification and
small for gestational age.
• But it also show which group has the higher rate of SGA.
drtamil@gmail.com
Results From R Studio
• Underweight mothers
has a higher rate (94%)
of SGA, compared to
normal mothers (58%)
and overweight
mothers (26%).
drtamil@gmail.com
Question 3
drtamil@gmail.com
Question 3
drtamil@gmail.com
Exercise 4
• 4. Conduct the
appropriate statistical test
to test whether there is
any association between
BMI and OUTCOME.
• Basically we are
comparing the mean BMI
of SGA babies’ mothers
against mean BMI of
Normal babies’ mothers.
• Therefore the appropriate
test is Student’s t-test.
drtamil@gmail.com
Variable 1 Variable 2 Criteria Type of Test
Qualitative Qualitative Sample size > 20 dan no
expected value < 5
Chi Square Test (X2
)
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
Student’s T-Test
• library("car")
• leveneTest(sga$mBMI,
sga$outcome)
– Levene's Test for
Homogeneity of
Variance
(center = median)
– Df F value Pr(>F)
– group 1 0.0827 0.7743
– 98
• Levene test reveals that
variances are not
significantly different
(P = 0.7743).
• Therefore when we run
the t-test, it is for equal
variances.
drtamil@gmail.com
T-Test Results from Studio R
• t.test(sga$mBMI ~
sga$outcome, var.equal=TRUE)
– Two Sample t-test
– data: sga$mBMI by sga$outcome
– t = 4.5164, df = 98, p-value =
1.756e-05
– alternative hypothesis: true
difference in means is not equal
to 0
– 95 percent confidence interval:
2.207433 5.667658
– sample estimates:
– mean in group Normal 26.46453
– mean in group SGA 22.52699
• Studio R states that there is
a significant mean
difference of BMI (p =
1.756x10-5) between SGA
babies’ mothers (22.52) and
normal babies’ mothers
(26.46).
• Therefore mean BMI of SGA
babies’ mothers is
significantly lower than the
mean BMI of normal babies’
mothers.
drtamil@gmail.com
Question 4
drtamil@gmail.com
Question 4
drtamil@gmail.com
Exercise 5
• 5. Conduct the appropriate statistical test to find
any association between OBESCLAS
(Underweight/Normal/Overweight) and
BIRTHWGT.
• Basically we are comparing the mean
BIRTHWEIGHT of underweight mothers, normal
weight mothers and overweight mothers.
• Therefore the appropriate test is Analysis of
Variance (ANOVA).
drtamil@gmail.com
ANOVA
• library("car")
• leveneTest(sga$birthwgt,
sga$obesclas)
– Levene's Test for
Homogeneity of Variance
(center = median)
• Df F value Pr(>F)
– group 2 3.1702 0.04638 *
– 97
– ---
– Signif. codes: 0 ‘***’ 0.001
‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• Variance of birthwgt are
significantly different
between the groups
obesclas.
• Therefore when we run
the ANOVA, it is for
unequal variances.
drtamil@gmail.com
ANOVA – command
• tapply(sga$birthwgt, sga$obesclas, mean)
– Under Normal Over
– 2.187059 2.768250 3.245116
• tapply(sga$birthwgt, sga$obesclas, sd)
– Under Normal Over
– 0.3403999 0.6712861 0.6606179
• levels(sga$obesclas)
• summary(aov(sga$birthwgt ~ sga$obesclas))
drtamil@gmail.com
ANOVA – Results
• > levels(sga$obesclas)
• [1] "Under" "Normal" "Over"
• > summary(aov(sga$birthwgt ~ sga$obesclas))
• Df Sum Sq Mean Sq F value Pr(>F)
• sga$obesclas 2 14.39 7.196 18.49 1.58e-07 ***
• Residuals 97 37.76 0.389
• ---
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
drtamil@gmail.com
ANOVA Results from Studio R
• Studio R states that there is a significant mean difference of mean
birth weight (p < 0.0001) between underweight mothers’ (2.187),
normal mothers ‘(2.768) & overweight mothers’(3.245).
• Unfortunately it also proves that there is unequal variances of the
three means. So it fails the homogeneity of variances assumption.
drtamil@gmail.com
ANOVA Results – post hoc
• Post-hoc tests indicate there is significant
difference of birth weight between ALL the
three groups.
drtamil@gmail.com
pairwise.t.test(sga$birthwgt, sga$obesclas, p.adjust.method ="bonferroni")
Question 5
drtamil@gmail.com
tapply(sga$birthwgt, sga$obesclas, mean)
Under Normal Over
2.187059 2.768250 3.245116
tapply(sga$birthwgt, sga$obesclas, sd)
Under Normal Over
0.3403999 0.6712861 0.6606179
Question 5
drtamil@gmail.com
Exercise 6
• 6. Assuming that both variables
mBMI & BIRTHWGT are normally
distributed, conduct an appropriate
statistical test to prove the
association between the two
variables.
–Demonstrate the association using the
appropriate chart. Determine the
coefficient of determination.
drtamil@gmail.com
Pearson Correlation
• mBMI and birth weight are both normally distributed
continuous data. Since the aim is to measure the
strength and direction of the association between
these two continuous variable, therefore Pearson
Correlation is the most appropriate test.
drtamil@gmail.com
expected value < 5
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
Pearson’s Correlation
Command
• cor.test(sga$mBMI, sga$birthwgt,
method="pearson")
– Pearson's product-moment
correlation
– data: sga$mBMI and sga$birthwgt
– t = 5.4379, df = 98, p-value =
3.959e-07
– alternative hypothesis: true
correlation is not equal to 0
– 95 percent confidence interval:
– 0.3148037 0.6193051
– sample estimates:
– cor
– 0.4814521
Discussion
• r = 0.4814521
• p-value = 3.959 x 10-7
• Fair & positive correlation
between mBMI and Birthweight.
• Therefore as mothers’ BMI
increases, the birth weight also
increases.
• r2 =0.48145212 = 0.2318
• 23.18% (r2=0.2318) variability of
the birth weight is determined by
the variability of the mothers’
BMI.
drtamil@gmail.com
plot(x = sga$mBMI, y = sga$birthwgt, type = 'p')
drtamil@gmail.com
Question 6
drtamil@gmail.com
Question 6
drtamil@gmail.com
r2 = 0.48145212 = 0.2318
Exercise 7
• 7. Conduct Simple Linear Regression
using BIRTHWGT as the dependent
variable. Try to come out with a
formula that will predict the baby’s
birth weight based on the mother’s
BMI.
–y = a + bx
drtamil@gmail.com
Simple Linear Regression
• mBMI and birth weight are both normally distributed
continuous data. Since the aim is to come out with a
regression formula between these two continuous
variable, therefore Simple Linear Regression is the
most appropriate test.
drtamil@gmail.com
expected value < 5
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 30 Proportionate Test
Qualitative
Dichotomus
Qualitative
Dichotomus
Sample size > 40 but with at
least one expected value < 5
X2
Test with Yates
Correction
Qualitative
Dichotomus
Quantitative Normally distributed data Student's t Test
Qualitative
Polinomial
Quantitative Normally distributed data ANOVA
Quantitative Quantitative Repeated measurement of the
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Paired t Test
Quantitative -
continous
Quantitative -
continous
Normally distributed data Pearson Correlation
& Linear
Regresssion
plot(x = sga$mBMI, y = sga$birthwgt, type = 'p')
abline(lm(sga$birthwgt ~ sga$mBMI), col=‘red’, lty=2)
drtamil@gmail.com
Simple Linear Regression
• summary(lm(sga$birthwgt ~ sga$mBMI))
drtamil@gmail.com
SLR Results from Studio R
• Studio R states that there is a significant regression
coefficient (b=0.07330).
• The constant (a) is 1.07895
• 23.18% (r2=0.2318) variability of the birth weight is
determined by the variability of the mothers’ BMI.
• BW = 1.079 + 0.073BMI
• For every increase of BMI of 1 unit, BW increases 0.07kg.
drtamil@gmail.com
Question 7
drtamil@gmail.com
Question 7
drtamil@gmail.com
Question7
drtamil@gmail.com
The End
Credits to Prof Lin Naing @ Ayub
for the original notes.
drtamil@gmail.com

Introduction to Data Analysis With R and R Studio

  • 1.
    Data Analysis with R& R Studio Introduction drtamil@gmail.com
  • 2.
    Download & Install •You can download and install R for free from https://r-project.org/ • Upon installation, download and install the free version of R Studio Desktop from https://rstudio.com • Instructions at https://youtu.be/hXb47dmPCR8 drtamil@gmail.com
  • 3.
    Uniqueness of R& R Studio • R is one of the programming languages that provide an intensive environment for you to analyze, process, transform and visualize information. • It is the primary choice for many statisticians who want to involve themselves in designing statistical models for solving complex problems. • Data are usually entered and manipulated using spreadsheet such as Microsoft Excel. • Specific analysis requires specific commands. So you must know exactly what command is required for the analysis. drtamil@gmail.com
  • 4.
    Choosing the appropriate statisticaltests Use these tables to choose the appropriate statistical tests. drtamil@gmail.com
  • 5.
    Parametric Statistical Tests Variable1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion drtamil@gmail.com
  • 6.
    Non-parametric Statistical Tests Variable1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test Quantitative Quantitative Repeated measurement of the same individual & item Wilcoxon Rank Sign Test Quantitative - continous/ordina l Quantitative - continous Data not normally distributed Spearman/Kendall Rank Correlation drtamil@gmail.com
  • 7.
    Statistical Tests forQualitative Data Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Quantitative Data not normally distributed Kruskal-Wallis Onedrtamil@gmail.com
  • 8.
    R Hands-on Exercise Textin this blue colour are the commands to be typed in the Console window. drtamil@gmail.com
  • 9.
    URL for data& submit answers • Data - https://drive.google.com/file/d/1PzcqCzm5t9KQk kXAtlvO56bZlMojM8-b/view?usp=sharing • The analysis required https://wp.me/p4mYLF-vA • Submit answers at this link https://docs.google.com/forms/d/1o_L7ZjXF9Q1 PON2zDs_VwkKsLCHT4v- 8WruXhCiVq2Q/viewform drtamil@gmail.com
  • 10.
    Data – FactorsRelated to SGA drtamil@gmail.com
  • 11.
    A study toidentify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. • 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation • 2. Create a new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. • 3. Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/ Normal/Overweight) and OUTCOME. • 4. Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • 5. Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • 6. Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. – Demonstrate the association using the appropriate chart. Determine the coefficient of determination. • 7. Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birthweight based on the mother’s BMI. – y = a + bx drtamil@gmail.com
  • 12.
    Online form foranswers drtamil@gmail.com https://docs.google.com/forms/d/1o_L7ZjXF9Q1PON2zDs_VwkKsLCHT4v-8WruXhCiVq2Q/viewform
  • 13.
    New R projecttitled “Tutor”. drtamil@gmail.com
  • 14.
    Import Excel intoR Studio • Select the Excel file you downloaded earlier; “SGA.xls” drtamil@gmail.com
  • 15.
    Import Excel intoR Studio • Click “Import” and the following command are executed; – library(readxl) – sga <- read_excel ("C:/…./sga.xls") – View(sga) drtamil@gmail.com
  • 16.
    R Studio -compute A study to identify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation Copy and paste your answers into your Word file.
  • 17.
    Compute mBMI=weight/(height/100)^2 • sga$mBMI<- (sga$weight/(sga$height)^2) • View(sga) • mean(sga$mBMI) – [1] 24.49576 • sd(sga$mBMI) – [1] 4.767109
  • 18.
    Question 1 –BMI • mean(sga$mBMI) – [1] 24.49576 • sd(sga$mBMI) – [1] 4.767109
  • 19.
    Recode • 2. Createa new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. drtamil@gmail.com
  • 20.
    Recode mBMI intoOBESCLAS • sga$obesclas<-"" • sga$obesclas[sga$mBMI<20] <- 1 • sga$obesclas[sga$mBMI>=20 & sga$mBMI<25] <- 2 • sga$obesclas[sga$mBMI>=25] <- 3 • table(sga$obesclas) • sga$obesclas <- factor(sga$obesclas, levels = c(1,2,3),labels = c('Under', 'Normal', 'Over')) • table(sga$obesclas) drtamil@gmail.com
  • 21.
    Frequency table forOBESCLAS • table(sga$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
  • 22.
    Question 2 –Obese Classification • table(sga$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
  • 23.
    Exercise 3 • 3.Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/Normal/ Overweight) and OUTCOME. • Therefore most suitable analysis is Pearson Chi- square. SGA Normal TOTAL UnderW Normal OverW TOTAL 50 50 100 drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test
  • 24.
    Chi-Square Analysis • library(gmodels) •CrossTable(sga$obesclas, sga$outcome, digits=2, max.width = 5, expected=TRUE, prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, chisq=TRUE, format="SPSS") – Pearson's Chi-squared test – Chi^2 = 24.39111 d.f. = 2 p = 5.052871e-06 – Minimum expected frequency: 8.5 drtamil@gmail.com
  • 25.
    Chi-Square Results fromR Studio • R not only states that there is a significant association (p=5x10-6) between mother’s weight classification and small for gestational age. • But it also show which group has the higher rate of SGA. drtamil@gmail.com
  • 26.
    Results From RStudio • Underweight mothers has a higher rate (94%) of SGA, compared to normal mothers (58%) and overweight mothers (26%). drtamil@gmail.com
  • 27.
  • 28.
  • 29.
    Exercise 4 • 4.Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • Basically we are comparing the mean BMI of SGA babies’ mothers against mean BMI of Normal babies’ mothers. • Therefore the appropriate test is Student’s t-test. drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  • 30.
    Student’s T-Test • library("car") •leveneTest(sga$mBMI, sga$outcome) – Levene's Test for Homogeneity of Variance (center = median) – Df F value Pr(>F) – group 1 0.0827 0.7743 – 98 • Levene test reveals that variances are not significantly different (P = 0.7743). • Therefore when we run the t-test, it is for equal variances. drtamil@gmail.com
  • 31.
    T-Test Results fromStudio R • t.test(sga$mBMI ~ sga$outcome, var.equal=TRUE) – Two Sample t-test – data: sga$mBMI by sga$outcome – t = 4.5164, df = 98, p-value = 1.756e-05 – alternative hypothesis: true difference in means is not equal to 0 – 95 percent confidence interval: 2.207433 5.667658 – sample estimates: – mean in group Normal 26.46453 – mean in group SGA 22.52699 • Studio R states that there is a significant mean difference of BMI (p = 1.756x10-5) between SGA babies’ mothers (22.52) and normal babies’ mothers (26.46). • Therefore mean BMI of SGA babies’ mothers is significantly lower than the mean BMI of normal babies’ mothers. drtamil@gmail.com
  • 32.
  • 33.
  • 34.
    Exercise 5 • 5.Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • Basically we are comparing the mean BIRTHWEIGHT of underweight mothers, normal weight mothers and overweight mothers. • Therefore the appropriate test is Analysis of Variance (ANOVA). drtamil@gmail.com
  • 35.
    ANOVA • library("car") • leveneTest(sga$birthwgt, sga$obesclas) –Levene's Test for Homogeneity of Variance (center = median) • Df F value Pr(>F) – group 2 3.1702 0.04638 * – 97 – --- – Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Variance of birthwgt are significantly different between the groups obesclas. • Therefore when we run the ANOVA, it is for unequal variances. drtamil@gmail.com
  • 36.
    ANOVA – command •tapply(sga$birthwgt, sga$obesclas, mean) – Under Normal Over – 2.187059 2.768250 3.245116 • tapply(sga$birthwgt, sga$obesclas, sd) – Under Normal Over – 0.3403999 0.6712861 0.6606179 • levels(sga$obesclas) • summary(aov(sga$birthwgt ~ sga$obesclas)) drtamil@gmail.com
  • 37.
    ANOVA – Results •> levels(sga$obesclas) • [1] "Under" "Normal" "Over" • > summary(aov(sga$birthwgt ~ sga$obesclas)) • Df Sum Sq Mean Sq F value Pr(>F) • sga$obesclas 2 14.39 7.196 18.49 1.58e-07 *** • Residuals 97 37.76 0.389 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 drtamil@gmail.com
  • 38.
    ANOVA Results fromStudio R • Studio R states that there is a significant mean difference of mean birth weight (p < 0.0001) between underweight mothers’ (2.187), normal mothers ‘(2.768) & overweight mothers’(3.245). • Unfortunately it also proves that there is unequal variances of the three means. So it fails the homogeneity of variances assumption. drtamil@gmail.com
  • 39.
    ANOVA Results –post hoc • Post-hoc tests indicate there is significant difference of birth weight between ALL the three groups. drtamil@gmail.com pairwise.t.test(sga$birthwgt, sga$obesclas, p.adjust.method ="bonferroni")
  • 40.
    Question 5 drtamil@gmail.com tapply(sga$birthwgt, sga$obesclas,mean) Under Normal Over 2.187059 2.768250 3.245116 tapply(sga$birthwgt, sga$obesclas, sd) Under Normal Over 0.3403999 0.6712861 0.6606179
  • 41.
  • 42.
    Exercise 6 • 6.Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. –Demonstrate the association using the appropriate chart. Determine the coefficient of determination. drtamil@gmail.com
  • 43.
    Pearson Correlation • mBMIand birth weight are both normally distributed continuous data. Since the aim is to measure the strength and direction of the association between these two continuous variable, therefore Pearson Correlation is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  • 44.
    Pearson’s Correlation Command • cor.test(sga$mBMI,sga$birthwgt, method="pearson") – Pearson's product-moment correlation – data: sga$mBMI and sga$birthwgt – t = 5.4379, df = 98, p-value = 3.959e-07 – alternative hypothesis: true correlation is not equal to 0 – 95 percent confidence interval: – 0.3148037 0.6193051 – sample estimates: – cor – 0.4814521 Discussion • r = 0.4814521 • p-value = 3.959 x 10-7 • Fair & positive correlation between mBMI and Birthweight. • Therefore as mothers’ BMI increases, the birth weight also increases. • r2 =0.48145212 = 0.2318 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. drtamil@gmail.com
  • 45.
    plot(x = sga$mBMI,y = sga$birthwgt, type = 'p') drtamil@gmail.com
  • 46.
  • 47.
  • 48.
    Exercise 7 • 7.Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birth weight based on the mother’s BMI. –y = a + bx drtamil@gmail.com
  • 49.
    Simple Linear Regression •mBMI and birth weight are both normally distributed continuous data. Since the aim is to come out with a regression formula between these two continuous variable, therefore Simple Linear Regression is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  • 50.
    plot(x = sga$mBMI,y = sga$birthwgt, type = 'p') abline(lm(sga$birthwgt ~ sga$mBMI), col=‘red’, lty=2) drtamil@gmail.com
  • 51.
    Simple Linear Regression •summary(lm(sga$birthwgt ~ sga$mBMI)) drtamil@gmail.com
  • 52.
    SLR Results fromStudio R • Studio R states that there is a significant regression coefficient (b=0.07330). • The constant (a) is 1.07895 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. • BW = 1.079 + 0.073BMI • For every increase of BMI of 1 unit, BW increases 0.07kg. drtamil@gmail.com
  • 53.
  • 54.
  • 55.
  • 56.
    The End Credits toProf Lin Naing @ Ayub for the original notes. drtamil@gmail.com