Successfully reported this slideshow.
Upcoming SlideShare
×

# Introduction to Data Analysis With R and R Studio

Introduction to Data Analysis With R and R Studio

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Introduction to Data Analysis With R and R Studio

1. 1. Data Analysis with R & R Studio Introduction drtamil@gmail.com
3. 3. Uniqueness of R & R Studio • R is one of the programming languages that provide an intensive environment for you to analyze, process, transform and visualize information. • It is the primary choice for many statisticians who want to involve themselves in designing statistical models for solving complex problems. • Data are usually entered and manipulated using spreadsheet such as Microsoft Excel. • Specific analysis requires specific commands. So you must know exactly what command is required for the analysis. drtamil@gmail.com
4. 4. Choosing the appropriate statistical tests Use these tables to choose the appropriate statistical tests. drtamil@gmail.com
5. 5. Parametric Statistical Tests Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion drtamil@gmail.com
6. 6. Non-parametric Statistical Tests Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test Quantitative Quantitative Repeated measurement of the same individual & item Wilcoxon Rank Sign Test Quantitative - continous/ordina l Quantitative - continous Data not normally distributed Spearman/Kendall Rank Correlation drtamil@gmail.com
7. 7. Statistical Tests for Qualitative Data Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Quantitative Data not normally distributed Kruskal-Wallis Onedrtamil@gmail.com
8. 8. R Hands-on Exercise Text in this blue colour are the commands to be typed in the Console window. drtamil@gmail.com
9. 9. URL for data & submit answers • Data - https://drive.google.com/file/d/1PzcqCzm5t9KQk kXAtlvO56bZlMojM8-b/view?usp=sharing • The analysis required https://wp.me/p4mYLF-vA • Submit answers at this link https://docs.google.com/forms/d/1o_L7ZjXF9Q1 PON2zDs_VwkKsLCHT4v- 8WruXhCiVq2Q/viewform drtamil@gmail.com
10. 10. Data – Factors Related to SGA drtamil@gmail.com
11. 11. A study to identify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. • 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation • 2. Create a new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. • 3. Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/ Normal/Overweight) and OUTCOME. • 4. Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • 5. Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • 6. Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. – Demonstrate the association using the appropriate chart. Determine the coefficient of determination. • 7. Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birthweight based on the mother’s BMI. – y = a + bx drtamil@gmail.com
13. 13. New R project titled “Tutor”. drtamil@gmail.com
14. 14. Import Excel into R Studio • Select the Excel file you downloaded earlier; “SGA.xls” drtamil@gmail.com
15. 15. Import Excel into R Studio • Click “Import” and the following command are executed; – library(readxl) – sga <- read_excel ("C:/…./sga.xls") – View(sga) drtamil@gmail.com
16. 16. R Studio - compute A study to identify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation Copy and paste your answers into your Word file.
17. 17. Compute mBMI=weight/(height/100)^2 • sga\$mBMI <- (sga\$weight/(sga\$height)^2) • View(sga) • mean(sga\$mBMI) – [1] 24.49576 • sd(sga\$mBMI) – [1] 4.767109
18. 18. Question 1 – BMI • mean(sga\$mBMI) – [1] 24.49576 • sd(sga\$mBMI) – [1] 4.767109
19. 19. Recode • 2. Create a new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. drtamil@gmail.com
20. 20. Recode mBMI into OBESCLAS • sga\$obesclas<-"" • sga\$obesclas[sga\$mBMI<20] <- 1 • sga\$obesclas[sga\$mBMI>=20 & sga\$mBMI<25] <- 2 • sga\$obesclas[sga\$mBMI>=25] <- 3 • table(sga\$obesclas) • sga\$obesclas <- factor(sga\$obesclas, levels = c(1,2,3),labels = c('Under', 'Normal', 'Over')) • table(sga\$obesclas) drtamil@gmail.com
21. 21. Frequency table for OBESCLAS • table(sga\$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga\$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
22. 22. Question 2 – Obese Classification • table(sga\$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga\$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
23. 23. Exercise 3 • 3. Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/Normal/ Overweight) and OUTCOME. • Therefore most suitable analysis is Pearson Chi- square. SGA Normal TOTAL UnderW Normal OverW TOTAL 50 50 100 drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test
24. 24. Chi-Square Analysis • library(gmodels) • CrossTable(sga\$obesclas, sga\$outcome, digits=2, max.width = 5, expected=TRUE, prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, chisq=TRUE, format="SPSS") – Pearson's Chi-squared test – Chi^2 = 24.39111 d.f. = 2 p = 5.052871e-06 – Minimum expected frequency: 8.5 drtamil@gmail.com
25. 25. Chi-Square Results from R Studio • R not only states that there is a significant association (p=5x10-6) between mother’s weight classification and small for gestational age. • But it also show which group has the higher rate of SGA. drtamil@gmail.com
26. 26. Results From R Studio • Underweight mothers has a higher rate (94%) of SGA, compared to normal mothers (58%) and overweight mothers (26%). drtamil@gmail.com
27. 27. Question 3 drtamil@gmail.com
28. 28. Question 3 drtamil@gmail.com
29. 29. Exercise 4 • 4. Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • Basically we are comparing the mean BMI of SGA babies’ mothers against mean BMI of Normal babies’ mothers. • Therefore the appropriate test is Student’s t-test. drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
30. 30. Student’s T-Test • library("car") • leveneTest(sga\$mBMI, sga\$outcome) – Levene's Test for Homogeneity of Variance (center = median) – Df F value Pr(>F) – group 1 0.0827 0.7743 – 98 • Levene test reveals that variances are not significantly different (P = 0.7743). • Therefore when we run the t-test, it is for equal variances. drtamil@gmail.com
31. 31. T-Test Results from Studio R • t.test(sga\$mBMI ~ sga\$outcome, var.equal=TRUE) – Two Sample t-test – data: sga\$mBMI by sga\$outcome – t = 4.5164, df = 98, p-value = 1.756e-05 – alternative hypothesis: true difference in means is not equal to 0 – 95 percent confidence interval: 2.207433 5.667658 – sample estimates: – mean in group Normal 26.46453 – mean in group SGA 22.52699 • Studio R states that there is a significant mean difference of BMI (p = 1.756x10-5) between SGA babies’ mothers (22.52) and normal babies’ mothers (26.46). • Therefore mean BMI of SGA babies’ mothers is significantly lower than the mean BMI of normal babies’ mothers. drtamil@gmail.com
32. 32. Question 4 drtamil@gmail.com
33. 33. Question 4 drtamil@gmail.com
34. 34. Exercise 5 • 5. Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • Basically we are comparing the mean BIRTHWEIGHT of underweight mothers, normal weight mothers and overweight mothers. • Therefore the appropriate test is Analysis of Variance (ANOVA). drtamil@gmail.com
35. 35. ANOVA • library("car") • leveneTest(sga\$birthwgt, sga\$obesclas) – Levene's Test for Homogeneity of Variance (center = median) • Df F value Pr(>F) – group 2 3.1702 0.04638 * – 97 – --- – Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Variance of birthwgt are significantly different between the groups obesclas. • Therefore when we run the ANOVA, it is for unequal variances. drtamil@gmail.com
36. 36. ANOVA – command • tapply(sga\$birthwgt, sga\$obesclas, mean) – Under Normal Over – 2.187059 2.768250 3.245116 • tapply(sga\$birthwgt, sga\$obesclas, sd) – Under Normal Over – 0.3403999 0.6712861 0.6606179 • levels(sga\$obesclas) • summary(aov(sga\$birthwgt ~ sga\$obesclas)) drtamil@gmail.com
37. 37. ANOVA – Results • > levels(sga\$obesclas) • [1] "Under" "Normal" "Over" • > summary(aov(sga\$birthwgt ~ sga\$obesclas)) • Df Sum Sq Mean Sq F value Pr(>F) • sga\$obesclas 2 14.39 7.196 18.49 1.58e-07 *** • Residuals 97 37.76 0.389 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 drtamil@gmail.com
38. 38. ANOVA Results from Studio R • Studio R states that there is a significant mean difference of mean birth weight (p < 0.0001) between underweight mothers’ (2.187), normal mothers ‘(2.768) & overweight mothers’(3.245). • Unfortunately it also proves that there is unequal variances of the three means. So it fails the homogeneity of variances assumption. drtamil@gmail.com
39. 39. ANOVA Results – post hoc • Post-hoc tests indicate there is significant difference of birth weight between ALL the three groups. drtamil@gmail.com pairwise.t.test(sga\$birthwgt, sga\$obesclas, p.adjust.method ="bonferroni")
40. 40. Question 5 drtamil@gmail.com tapply(sga\$birthwgt, sga\$obesclas, mean) Under Normal Over 2.187059 2.768250 3.245116 tapply(sga\$birthwgt, sga\$obesclas, sd) Under Normal Over 0.3403999 0.6712861 0.6606179
41. 41. Question 5 drtamil@gmail.com
42. 42. Exercise 6 • 6. Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. –Demonstrate the association using the appropriate chart. Determine the coefficient of determination. drtamil@gmail.com
43. 43. Pearson Correlation • mBMI and birth weight are both normally distributed continuous data. Since the aim is to measure the strength and direction of the association between these two continuous variable, therefore Pearson Correlation is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
44. 44. Pearson’s Correlation Command • cor.test(sga\$mBMI, sga\$birthwgt, method="pearson") – Pearson's product-moment correlation – data: sga\$mBMI and sga\$birthwgt – t = 5.4379, df = 98, p-value = 3.959e-07 – alternative hypothesis: true correlation is not equal to 0 – 95 percent confidence interval: – 0.3148037 0.6193051 – sample estimates: – cor – 0.4814521 Discussion • r = 0.4814521 • p-value = 3.959 x 10-7 • Fair & positive correlation between mBMI and Birthweight. • Therefore as mothers’ BMI increases, the birth weight also increases. • r2 =0.48145212 = 0.2318 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. drtamil@gmail.com
45. 45. plot(x = sga\$mBMI, y = sga\$birthwgt, type = 'p') drtamil@gmail.com
46. 46. Question 6 drtamil@gmail.com
47. 47. Question 6 drtamil@gmail.com r2 = 0.48145212 = 0.2318
48. 48. Exercise 7 • 7. Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birth weight based on the mother’s BMI. –y = a + bx drtamil@gmail.com
49. 49. Simple Linear Regression • mBMI and birth weight are both normally distributed continuous data. Since the aim is to come out with a regression formula between these two continuous variable, therefore Simple Linear Regression is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
50. 50. plot(x = sga\$mBMI, y = sga\$birthwgt, type = 'p') abline(lm(sga\$birthwgt ~ sga\$mBMI), col=‘red’, lty=2) drtamil@gmail.com
51. 51. Simple Linear Regression • summary(lm(sga\$birthwgt ~ sga\$mBMI)) drtamil@gmail.com
52. 52. SLR Results from Studio R • Studio R states that there is a significant regression coefficient (b=0.07330). • The constant (a) is 1.07895 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. • BW = 1.079 + 0.073BMI • For every increase of BMI of 1 unit, BW increases 0.07kg. drtamil@gmail.com
53. 53. Question 7 drtamil@gmail.com
54. 54. Question 7 drtamil@gmail.com
55. 55. Question7 drtamil@gmail.com
56. 56. The End Credits to Prof Lin Naing @ Ayub for the original notes. drtamil@gmail.com

### Be the first to comment

Introduction to Data Analysis With R and R Studio

Total views

728

On Slideshare

0

From embeds

0

Number of embeds

9

44

Shares

0