Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Data Analysis With R and R Studio

314 views

Published on

Introduction to Data Analysis With R and R Studio

Published in: Education
  • Be the first to comment

  • Be the first to like this

Introduction to Data Analysis With R and R Studio

  1. 1. Data Analysis with R & R Studio Introduction drtamil@gmail.com
  2. 2. Download & Install • You can download and install R for free from https://r-project.org/ • Upon installation, download and install the free version of R Studio Desktop from https://rstudio.com • Instructions at https://youtu.be/hXb47dmPCR8 drtamil@gmail.com
  3. 3. Uniqueness of R & R Studio • R is one of the programming languages that provide an intensive environment for you to analyze, process, transform and visualize information. • It is the primary choice for many statisticians who want to involve themselves in designing statistical models for solving complex problems. • Data are usually entered and manipulated using spreadsheet such as Microsoft Excel. • Specific analysis requires specific commands. So you must know exactly what command is required for the analysis. drtamil@gmail.com
  4. 4. Choosing the appropriate statistical tests Use these tables to choose the appropriate statistical tests. drtamil@gmail.com
  5. 5. Parametric Statistical Tests Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion drtamil@gmail.com
  6. 6. Non-parametric Statistical Tests Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test Quantitative Quantitative Repeated measurement of the same individual & item Wilcoxon Rank Sign Test Quantitative - continous/ordina l Quantitative - continous Data not normally distributed Spearman/Kendall Rank Correlation drtamil@gmail.com
  7. 7. Statistical Tests for Qualitative Data Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Quantitative Data not normally distributed Kruskal-Wallis Onedrtamil@gmail.com
  8. 8. R Hands-on Exercise Text in this blue colour are the commands to be typed in the Console window. drtamil@gmail.com
  9. 9. URL for data & submit answers • Data - https://drive.google.com/file/d/1PzcqCzm5t9KQk kXAtlvO56bZlMojM8-b/view?usp=sharing • The analysis required https://wp.me/p4mYLF-vA • Submit answers at this link https://docs.google.com/forms/d/1o_L7ZjXF9Q1 PON2zDs_VwkKsLCHT4v- 8WruXhCiVq2Q/viewform drtamil@gmail.com
  10. 10. Data – Factors Related to SGA drtamil@gmail.com
  11. 11. A study to identify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. • 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation • 2. Create a new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. • 3. Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/ Normal/Overweight) and OUTCOME. • 4. Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • 5. Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • 6. Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. – Demonstrate the association using the appropriate chart. Determine the coefficient of determination. • 7. Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birthweight based on the mother’s BMI. – y = a + bx drtamil@gmail.com
  12. 12. Online form for answers drtamil@gmail.com https://docs.google.com/forms/d/1o_L7ZjXF9Q1PON2zDs_VwkKsLCHT4v-8WruXhCiVq2Q/viewform
  13. 13. New R project titled “Tutor”. drtamil@gmail.com
  14. 14. Import Excel into R Studio • Select the Excel file you downloaded earlier; “SGA.xls” drtamil@gmail.com
  15. 15. Import Excel into R Studio • Click “Import” and the following command are executed; – library(readxl) – sga <- read_excel ("C:/…./sga.xls") – View(sga) drtamil@gmail.com
  16. 16. R Studio - compute A study to identify factors that can cause small for gestational age (SGA) was conducted. Among the factors studied were the mothers’ body mass index (BMI). It is believed that mothers with lower BMI were of higher risk to get SGA babies. 1. Create a new variable mBMI (Mothers’ Body Mass Index) from the mothers’ HEIGHT (in metre) & WEIGHT (first trimester weight in kg). mBMI = weight in kg/(height in metre)2. Calculate the following for mBMI; – Mean – Standard deviation Copy and paste your answers into your Word file.
  17. 17. Compute mBMI=weight/(height/100)^2 • sga$mBMI <- (sga$weight/(sga$height)^2) • View(sga) • mean(sga$mBMI) – [1] 24.49576 • sd(sga$mBMI) – [1] 4.767109
  18. 18. Question 1 – BMI • mean(sga$mBMI) – [1] 24.49576 • sd(sga$mBMI) – [1] 4.767109
  19. 19. Recode • 2. Create a new variable OBESCLAS (Classification of Obesity) from mBMI. Use the following cutoff point; – <20 = Underweight – 20 – 24.99 = Normal – 25 or larger = Overweight – Create a frequency table for OBESCLAS. drtamil@gmail.com
  20. 20. Recode mBMI into OBESCLAS • sga$obesclas<-"" • sga$obesclas[sga$mBMI<20] <- 1 • sga$obesclas[sga$mBMI>=20 & sga$mBMI<25] <- 2 • sga$obesclas[sga$mBMI>=25] <- 3 • table(sga$obesclas) • sga$obesclas <- factor(sga$obesclas, levels = c(1,2,3),labels = c('Under', 'Normal', 'Over')) • table(sga$obesclas) drtamil@gmail.com
  21. 21. Frequency table for OBESCLAS • table(sga$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
  22. 22. Question 2 – Obese Classification • table(sga$obesclas) – Under Normal Over – 17 40 43 • prop.table(table(sga$obesclas)) – Under Normal Over – 0.17 0.40 0.43 – 17% 40% 43% drtamil@gmail.com
  23. 23. Exercise 3 • 3. Conduct the appropriate statistical test to test whether there is any association between OBESCLAS (Underweight/Normal/ Overweight) and OUTCOME. • Therefore most suitable analysis is Pearson Chi- square. SGA Normal TOTAL UnderW Normal OverW TOTAL 50 50 100 drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Variable 1 Variable 2 Criteria Type of Test Qualitative Dichotomus Qualitative Dichotomus Sample size < 20 or (< 40 but with at least one expected value < 5) Fisher Test Qualitative Dichotomus Quantitative Data not normally distributed Wilcoxon Rank Sum Test or U Mann- Whitney Test Qualitative Polinomial Quantitative Data not normally distributed Kruskal-Wallis One Way ANOVA Test
  24. 24. Chi-Square Analysis • library(gmodels) • CrossTable(sga$obesclas, sga$outcome, digits=2, max.width = 5, expected=TRUE, prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, chisq=TRUE, format="SPSS") – Pearson's Chi-squared test – Chi^2 = 24.39111 d.f. = 2 p = 5.052871e-06 – Minimum expected frequency: 8.5 drtamil@gmail.com
  25. 25. Chi-Square Results from R Studio • R not only states that there is a significant association (p=5x10-6) between mother’s weight classification and small for gestational age. • But it also show which group has the higher rate of SGA. drtamil@gmail.com
  26. 26. Results From R Studio • Underweight mothers has a higher rate (94%) of SGA, compared to normal mothers (58%) and overweight mothers (26%). drtamil@gmail.com
  27. 27. Question 3 drtamil@gmail.com
  28. 28. Question 3 drtamil@gmail.com
  29. 29. Exercise 4 • 4. Conduct the appropriate statistical test to test whether there is any association between BMI and OUTCOME. • Basically we are comparing the mean BMI of SGA babies’ mothers against mean BMI of Normal babies’ mothers. • Therefore the appropriate test is Student’s t-test. drtamil@gmail.com Variable 1 Variable 2 Criteria Type of Test Qualitative Qualitative Sample size > 20 dan no expected value < 5 Chi Square Test (X2 ) Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  30. 30. Student’s T-Test • library("car") • leveneTest(sga$mBMI, sga$outcome) – Levene's Test for Homogeneity of Variance (center = median) – Df F value Pr(>F) – group 1 0.0827 0.7743 – 98 • Levene test reveals that variances are not significantly different (P = 0.7743). • Therefore when we run the t-test, it is for equal variances. drtamil@gmail.com
  31. 31. T-Test Results from Studio R • t.test(sga$mBMI ~ sga$outcome, var.equal=TRUE) – Two Sample t-test – data: sga$mBMI by sga$outcome – t = 4.5164, df = 98, p-value = 1.756e-05 – alternative hypothesis: true difference in means is not equal to 0 – 95 percent confidence interval: 2.207433 5.667658 – sample estimates: – mean in group Normal 26.46453 – mean in group SGA 22.52699 • Studio R states that there is a significant mean difference of BMI (p = 1.756x10-5) between SGA babies’ mothers (22.52) and normal babies’ mothers (26.46). • Therefore mean BMI of SGA babies’ mothers is significantly lower than the mean BMI of normal babies’ mothers. drtamil@gmail.com
  32. 32. Question 4 drtamil@gmail.com
  33. 33. Question 4 drtamil@gmail.com
  34. 34. Exercise 5 • 5. Conduct the appropriate statistical test to find any association between OBESCLAS (Underweight/Normal/Overweight) and BIRTHWGT. • Basically we are comparing the mean BIRTHWEIGHT of underweight mothers, normal weight mothers and overweight mothers. • Therefore the appropriate test is Analysis of Variance (ANOVA). drtamil@gmail.com
  35. 35. ANOVA • library("car") • leveneTest(sga$birthwgt, sga$obesclas) – Levene's Test for Homogeneity of Variance (center = median) • Df F value Pr(>F) – group 2 3.1702 0.04638 * – 97 – --- – Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Variance of birthwgt are significantly different between the groups obesclas. • Therefore when we run the ANOVA, it is for unequal variances. drtamil@gmail.com
  36. 36. ANOVA – command • tapply(sga$birthwgt, sga$obesclas, mean) – Under Normal Over – 2.187059 2.768250 3.245116 • tapply(sga$birthwgt, sga$obesclas, sd) – Under Normal Over – 0.3403999 0.6712861 0.6606179 • levels(sga$obesclas) • summary(aov(sga$birthwgt ~ sga$obesclas)) drtamil@gmail.com
  37. 37. ANOVA – Results • > levels(sga$obesclas) • [1] "Under" "Normal" "Over" • > summary(aov(sga$birthwgt ~ sga$obesclas)) • Df Sum Sq Mean Sq F value Pr(>F) • sga$obesclas 2 14.39 7.196 18.49 1.58e-07 *** • Residuals 97 37.76 0.389 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 drtamil@gmail.com
  38. 38. ANOVA Results from Studio R • Studio R states that there is a significant mean difference of mean birth weight (p < 0.0001) between underweight mothers’ (2.187), normal mothers ‘(2.768) & overweight mothers’(3.245). • Unfortunately it also proves that there is unequal variances of the three means. So it fails the homogeneity of variances assumption. drtamil@gmail.com
  39. 39. ANOVA Results – post hoc • Post-hoc tests indicate there is significant difference of birth weight between ALL the three groups. drtamil@gmail.com pairwise.t.test(sga$birthwgt, sga$obesclas, p.adjust.method ="bonferroni")
  40. 40. Question 5 drtamil@gmail.com tapply(sga$birthwgt, sga$obesclas, mean) Under Normal Over 2.187059 2.768250 3.245116 tapply(sga$birthwgt, sga$obesclas, sd) Under Normal Over 0.3403999 0.6712861 0.6606179
  41. 41. Question 5 drtamil@gmail.com
  42. 42. Exercise 6 • 6. Assuming that both variables mBMI & BIRTHWGT are normally distributed, conduct an appropriate statistical test to prove the association between the two variables. –Demonstrate the association using the appropriate chart. Determine the coefficient of determination. drtamil@gmail.com
  43. 43. Pearson Correlation • mBMI and birth weight are both normally distributed continuous data. Since the aim is to measure the strength and direction of the association between these two continuous variable, therefore Pearson Correlation is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  44. 44. Pearson’s Correlation Command • cor.test(sga$mBMI, sga$birthwgt, method="pearson") – Pearson's product-moment correlation – data: sga$mBMI and sga$birthwgt – t = 5.4379, df = 98, p-value = 3.959e-07 – alternative hypothesis: true correlation is not equal to 0 – 95 percent confidence interval: – 0.3148037 0.6193051 – sample estimates: – cor – 0.4814521 Discussion • r = 0.4814521 • p-value = 3.959 x 10-7 • Fair & positive correlation between mBMI and Birthweight. • Therefore as mothers’ BMI increases, the birth weight also increases. • r2 =0.48145212 = 0.2318 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. drtamil@gmail.com
  45. 45. plot(x = sga$mBMI, y = sga$birthwgt, type = 'p') drtamil@gmail.com
  46. 46. Question 6 drtamil@gmail.com
  47. 47. Question 6 drtamil@gmail.com r2 = 0.48145212 = 0.2318
  48. 48. Exercise 7 • 7. Conduct Simple Linear Regression using BIRTHWGT as the dependent variable. Try to come out with a formula that will predict the baby’s birth weight based on the mother’s BMI. –y = a + bx drtamil@gmail.com
  49. 49. Simple Linear Regression • mBMI and birth weight are both normally distributed continuous data. Since the aim is to come out with a regression formula between these two continuous variable, therefore Simple Linear Regression is the most appropriate test. drtamil@gmail.com expected value < 5 Qualitative Dichotomus Qualitative Dichotomus Sample size > 30 Proportionate Test Qualitative Dichotomus Qualitative Dichotomus Sample size > 40 but with at least one expected value < 5 X2 Test with Yates Correction Qualitative Dichotomus Quantitative Normally distributed data Student's t Test Qualitative Polinomial Quantitative Normally distributed data ANOVA Quantitative Quantitative Repeated measurement of the same individual & item (e.g. Hb level before & after treatment). Normally distributed data Paired t Test Quantitative - continous Quantitative - continous Normally distributed data Pearson Correlation & Linear Regresssion
  50. 50. plot(x = sga$mBMI, y = sga$birthwgt, type = 'p') abline(lm(sga$birthwgt ~ sga$mBMI), col=‘red’, lty=2) drtamil@gmail.com
  51. 51. Simple Linear Regression • summary(lm(sga$birthwgt ~ sga$mBMI)) drtamil@gmail.com
  52. 52. SLR Results from Studio R • Studio R states that there is a significant regression coefficient (b=0.07330). • The constant (a) is 1.07895 • 23.18% (r2=0.2318) variability of the birth weight is determined by the variability of the mothers’ BMI. • BW = 1.079 + 0.073BMI • For every increase of BMI of 1 unit, BW increases 0.07kg. drtamil@gmail.com
  53. 53. Question 7 drtamil@gmail.com
  54. 54. Question 7 drtamil@gmail.com
  55. 55. Question7 drtamil@gmail.com
  56. 56. The End Credits to Prof Lin Naing @ Ayub for the original notes. drtamil@gmail.com

×