Your SlideShare is downloading.
×

- 1. Khaled El-Sham’aa 1
- 2. Session Road Map First Steps ANOVA Importing Data into R PCA R Basics Clustering Data Visualization Time Series Correlation & Regression Programming t-Test Publication-Quality output Chi-squared Test 2
- 3. First Steps (1) R is one of the most popular platforms for data analysis and visualization currently available. It is free and open source software: http://www.r-project.org Take advantage of its coverage and availability of new, cutting edge applications/techniques. R will enable us to develop and distribute solutions to our NARS with no hidden license cost. 3
- 4. First Steps (2) 4
- 5. First Steps (3) 5 * 4 b[4] [1] 20 [1] 5 a <- (3 * 7) + 1 b[1:3] a [1] 1 2 3 [1] 22 b[c(1,3,5)] b <- c(1, 2, 3, 5, 8) [1] 1 3 8 b * 2 [1] 2 4 6 10 16 b[b > 4] [1] 5 8 5
- 6. First Steps (4) citation() R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. 6
- 7. First Steps (5) If you know the name of the function you want help with, you just type a question mark ? at the command line prompt followed by the name of the function: ?read.table 7
- 8. First Steps (6) Sometimes you cannot remember the precise name of the function, but you know the subject on which you want help. Use the help.search function with your query in double quotes like this: help.search("data input") 8
- 9. First Steps (7) To see a worked example just type the function name: example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.10)) [1] 8.75 5.50 mean> mean(USArrests, trim = 0.2) Murder Assault UrbanPop Rape 7.42 167.60 66.20 20.16 9
- 10. First Steps (8) There are hundreds of contributed packages for R, written by many different authors (to implement specialized statistical methods). Most are available for download from CRAN (http://CRAN.R-project.org) List all available packages: library() Load package “ggplot2”: library(ggplot2) Documentation on package library(help=ggplot2) 10
- 11. Importing Data into R (1) data <- read.table("D:/path/file.txt", header=TRUE) data <- read.csv(file.choose(), header=TRUE, sep=";") data <- edit(data) fix(data) head(data) tail(data) tail(data, 10) 11
- 12. Importing Data into R (2) In order to refer to a vector by name with an R session, you need to attach the dataframe containing the vector. Alternatively, you can refer to the dataframe name and the vector name within it, using the element name operator $ like this: mtcars$mpg ?mtcars attach(mtcars) mpg 12
- 13. Importing Data into R (3) 13
- 14. Importing Data into R (4) # Read data left on the clipboard data <- read.table("clipboard", header=T) # ODBC library(RODBC) db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd") raw <- sqlQuery(db1, "SELECT * FROM table1") # XLSX library(XLConnect) xls <- loadWorkbook("my_file.xlsx", create=F) raw <- as.data.frame(readWorksheet(xls,sheet='Sheet1')) 14
- 15. R Basics (1) max(x) maximum value in x min(x) minimum value in x mean(x) arithmetic average of the values in x median(x) median value in x var(x) sample variance of x sd(x) standard deviation of x cor(x,y) correlation between vectors x and y summary(x) generic function used to produce result summaries of the results of various functions 15
- 16. R Basics (2) abs(x) absolute value floor(2.718) largest integers not greater than ceiling(3.142) smallest integer not less than x asin(x) inverse sine of x in radians round(2.718, digits=2) returns 2.72 x <- 1:12; sample(x) Simple randomization RCBD randomization: RCBD <- replicate(3, sample(x)) 16
- 17. R Basics (3) Common Data Transformation: Nature of Data Transformation R Syntax Measurements (lengths, weights, etc) loge log(x) log10 log(x, 10) Log10 log10(x) Log x+1 log(x + 1) Counts (number of individuals, etc) sqrt(x) Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi * where x is the name of the vector (variable) whose values are to be transformed. 17
- 18. R Basics (4) Vectorized computations: Any function call or operator apply to a vector in will automatically operates directly on all elements of the vector. nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8 The recycling rule: The shorter vector is replicated enough times so that the result has the length of the longer vector, then the operator is applied. 1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11 18
- 19. R Basics (5) mydata <- matrix(rnorm(30), nrow=6) mydata # calculate the 6 row means apply(mydata, 1, mean) # calculate the 5 column means apply(mydata, 2, mean) apply(mydata, 2, mean, trim=0.2) 19
- 20. R Basics (6) String functions: substr(month.name, 2, 3) paste("*", month.name[1:4], "*", sep=" ") x <- toupper(dna.seq) rna.seq <- chartr("T", "U", x) comp.seq <- chartr("ACTG", "TGAC", dna.seq) 20
- 21. R Basics (7) Surprisingly, the base installation doesn’t provide functions for skew and kurtosis, but you can add your own: m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n – 3 21
- 22. Data Visualization (1) Pairs for a matrix of scatter plots of every variable against every other: ?mtcars pairs(mtcars) Voilà! 22
- 23. Data Visualization (2) pie(table(cyl)) barplot(table(cyl)) 23
- 24. Data Visualization (3) Gives a scatter plot if x is continuous, and a box-and- whisker plot if x is a factor. Some people prefer the alternative syntax plot(y~x): attach(mtcars) plot(wt, mpg) plot(cyl, mpg) cyl <- factor(cyl) plot(cyl, mpg) 24
- 25. Data Visualization (4) 25
- 26. Data Visualization (5) Histograms show a frequency distribution hist(qsec, col="gray") 26
- 27. Data Visualization (6) boxplot(qsec, col="gray") boxplot(qsec, mpg, col="gray") 27
- 28. Data Visualization (7) XY <- cbind(LAT, LONG) plot(XY, type='l') library(sp) XY.poly <- Polygon(XY) XY.pnt <- spsample(XY.poly, n=8, type='random') XY.pnt points(XY.pnt) 28
- 29. Data Visualization (8) 29
- 30. Correlation and Regression (1) If you want to determine the significance of a correlation (i.e. the p value associated with the calculated value of r) then use cor.test rather than cor. cor(wt, mpg) [1] -0.8676594 The value will vary from -1 to +1. A -1 indicates perfect negative correlation, and +1 indicates perfect positive correlation. 0 means no correlation. 30
- 31. Correlation and Regression (2) cor.test(wt, qsec) Pearson's product-moment correlation data: wt and qsec t = -0.9719, df = 30, p-value = 0.3389 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.4933536 0.1852649 sample estimates: cor -0.1747159 31
- 32. Correlation and Regression (3) cor.test(wt, mpg) Pearson's product-moment correlation data: wt and mpg t = -9.559, df = 30, p-value = 1.294e-10 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9338264 -0.7440872 sample estimates: cor -0.8676594 32
- 33. Correlation and Regression (4) 33
- 34. Correlation and Regression (5) Fits a linear model with normal errors and constant variance; generally this is used for regression analysis using continuous explanatory variables. fit <- lm(y ~ x) summary(fit) plot(x, y) # Sample of multiple linear regression fit <- lm(y ~ x1 + x2 + x3) 34
- 35. Correlation and Regression (6) Call: lm(formula = mpg ~ wt) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 35
- 36. Correlation and Regression (7) The great thing about graphics in R is that it is extremely straightforward to add things to your plots. In the present case, we might want to add a regression line through the cloud of data points. The function for this is abline which can take as its argument the linear model object: abline(fit) Note: abline(a, b) function adds a regression line with an intercept of a and a slope of b 36
- 37. Correlation and Regression (8) plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon") abline(fit, col="blue", lwd=2) text(4, 25, "mpg = 37.29 - 5.34 wt") 37
- 38. Correlation and Regression (9) Predict is a generic built-in function for predictions from the results of various model fitting functions: predict(fit, list(wt = 4.5)) [1] 13.23500 38
- 39. Correlation and Regression (10) 39
- 40. Correlation and Regression (11) What do you do if you identify problems? There are four approaches to dealing with violations of regression assumptions: Deleting observation Transforming variables Adding or deleting variables Using another regression approach 40
- 41. Correlation and Regression (12) You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. fit1 <- lm (y ~ A + B + C) fit2 <- lm (y ~ A + C) anova(fit1, fit2) If the test is not significant (i.e. p > 0.05), we conclude that B in this case don’t add to the linear prediction and we’re justified in dropping it from our model. 41
- 42. Correlation and Regression (13) # Bootstrap 95% CI for R-Squared library(boot) rsq <- function(formula, data, indices) { fit <- lm(formula, data= data[indices,]) return(summary(fit)$r.square) } rs <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp) boot.ci(rs, type="bca") # try print(rs) and plot(rs) 42
- 43. t-Test (1) Comparing two sample means with normal errors (Student’s t test, t.test) t.test(a, b) t.test(a, b, paired = TRUE) # alternative argument options: # "two.sided", "less", "greater" a <- qsec[cyl == 4] b <- qsec[cyl == 6] c <- qsec[cyl == 8] 43
- 44. t-Test (2) t.test(a, b) Welch Two Sample t-test data: a and b t = 1.4136, df = 12.781, p-value = 0.1814 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.6159443 2.9362040 sample estimates: mean of x mean of y 19.13727 17.97714 44
- 45. t-Test (3) t.test(a, c) Welch Two Sample t-test data: a and c t = 3.9446, df = 17.407, p-value = 0.001005 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.102361 3.627899 sample estimates: mean of x mean of y 19.13727 16.77214 45
- 46. t-Test (4) (a) Test the equality of variances assumption: ev <- var.test(a, c)$p.value (b) Test the normality assumption: an <- shapiro.test(a)$p.value bn <- shapiro.test(c)$p.value 46
- 47. Chi-squared Test (1) Construct hypotheses based on qualitative – categorical data: myTable <- table(am, cyl) myTable cyl am 4 6 8 automatic 3 4 12 manual 8 3 2 47
- 48. Chi-squared Test (2) chisq.test(myTable) Pearson's Chi-squared test data: myTable X-squared = 8.7407, df = 2, p-value = 0.01265 The expected counts under the null hypothesis: hisq.test(myTable)$expected cyl am 4 6 8 automatic 6.53125 4.15625 8.3125 manual 4.46875 2.84375 5.6875 48
- 49. Chi-squared Test (3) mosaicplot(myTable, color=rainbow(3)) 49
- 50. ANOVA (1) A method which partitions the total variation in the response into the components (sources of variation) in the above model is called the analysis of variance. table(N, S, Rep) N <- factor(N) S <- factor(S) Rep <- factor(Rep) 50
- 51. ANOVA (2) The best way to understand the two significant interaction terms is to plot them using interaction.plot like this: interaction.plot(S, N, Yield) 51
- 52. ANOVA (3) boxplot(Yield~N, col="gray") 52
- 53. ANOVA (4) model <- aov(Yield ~ N * S) #CRD summary(model) Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.2909 42.7469 1.230e-08 *** S 3 0.9798 0.3266 6.0944 0.003106 ** N:S 6 0.6517 0.1086 2.0268 0.101243 Residuals 24 1.2862 0.0536 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 53
- 54. ANOVA (5) par(mfrow = c(2, 2)) plot(model) ANOVA assumptions: Normality Linearity Constant variance Independence 54
- 55. ANOVA (6) model.tables(model, "means") Tables of means Grand mean 1.104722 N 0 180 230 0.6025 1.3142 1.3975 S 0 10 20 40 0.8289 1.1556 1.1678 1.2667 S N 0 10 20 40 0 0.5600 0.7733 0.5233 0.5533 180 0.8933 1.2900 1.5267 1.5467 230 1.0333 1.4033 1.4533 1.7000 55
- 56. ANOVA (7) model.tables(model, se=TRUE) ....... Standard errors for differences of means N S N:S 0.0945 0.1091 0.1890 replic. 12 9 3 Plot.design(Yield ~ N * S) 56
- 57. ANOVA (8) mc <- TukeyHSD(model, "N", ordered = TRUE); mc Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been ordered Fit: aov(formula = Yield ~ N * S) $N diff lwr upr p adj 180-0 0.71166667 0.4756506 0.9476827 0.0000003 230-0 0.79500000 0.5589840 1.0310160 0.0000000 230-180 0.08333333 -0.1526827 0.3193494 0.6567397 57
- 58. ANOVA (9) plot(mc) 58
- 59. ANOVA (10) summary(aov(Yield ~ N * S + Error(Rep))) #RCB Error: Rep Df Sum Sq Mean Sq F value Pr(>F) Residuals 2 0.30191 0.15095 Error: Within Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.2909 51.2035 5.289e-09 *** S 3 0.9798 0.3266 7.3001 0.001423 ** N:S 6 0.6517 0.1086 2.4277 0.059281 . Residuals 22 0.9843 0.0447 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 59
- 60. ANOVA (11) In a split-plot design, different treatments are applied to plots of different sizes. Each different plot size is associated with its own error variance. The model formula is specified as a factorial, using the asterisk notation. The error structure is defined in the Error term, with the plot sizes listed from left to right, from largest to smallest, with each variable separated by the slash operator /. model <- aov(Yield ~ N * S + Error(Rep/N)) 60
- 61. ANOVA (12) Error: Rep Df Sum Sq Mean Sq F value Pr(>F) Residuals 2 0.30191 0.15095 Error: Rep:N Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.29088 55.583 0.001206 ** Residuals 4 0.1649 0.04122 Error: Within Df Sum Sq Mean Sq F value Pr(>F) S 3 0.97983 0.32661 7.1744 0.002280 ** N:S 6 0.65171 0.10862 2.3860 0.071313 . Residuals 18 0.81943 0.04552 61
- 62. ANOVA (13) Analysis of Covariance: # f is treatment factor # x is variate acts as covariate model <- aov(y ~ x * f) Split both main effects into linear and quadratic parts. contrasts <- list(N = list(lin=1, quad=2), S = list(lin=1, quad=2)) summary(model, split=contrasts) 62
- 63. PCA (1) The idea of principal components analysis (PCA) is to find a small number of linear combinations of the variables so as to capture most of the variation in the dataframe as a whole. d2 <- cbind(wt, disp/10, hp/10, mpg, qsec) colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec") 63
- 64. PCA (2) model <- prcomp(d2) model Standard deviations: [1] 14.6949595 3.9627722 2.8306355 1.1593717 Rotation: PC1 PC2 PC3 PC4 wt -0.05887539 0.05015401 -0.07513271 -0.16910728 disp -0.83186362 0.47519625 0.28005113 0.04080894 hp -0.40572567 -0.83180078 0.24611265 -0.28768795 mpeg 0.36888799 0.12190490 0.91398919 -0.09385946 qsec 0.06200759 0.25479354 -0.14134625 -0.93710373 64
- 65. PCA (3) summary(model) Importance of components: PC1 PC2 PC3 Standard deviation 14.6950 3.96277 2.83064 Proportion of Variance 0.8957 0.06514 0.03323 Cumulative Proportion 0.8957 0.96082 0.99405 65
- 66. PCA (4) plot(model) biplot(model) 66
- 67. Clustering (1) We define similarity on the basis of the distance between two samples in this n-dimensional space. Several different distance measures could be used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the dist function: rownames(d2) <- rownames(mtcars) my.dist <- dist(d2, method="euclidian") 67
- 68. Clustering (2) Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster (see ?hclust for details). my.hc <- hclust(my.dist, "ward") 68
- 69. Clustering (3) We can plot the object called my.hc, and we specify that the leaves of the hierarchy are labeled by their plot numbers plot(my.hc, hang=-1) g <- rect.hclust(my.hc, k=4, border="red") Note: When the hang argument is set to '-1' then all leaves end on one line and their labels hang down from 0. 69
- 70. Clustering (4) 70
- 71. Clustering (5) Partitioning into a number of clusters specified by the user. gr <- kmeans(cbind(disp, hp), 2) plot(disp, hp, col = gr$cluster, pch=19) points(gr$centers, col = 1:2, pch = 8, cex=2) 71
- 72. Clustering (6) 72
- 73. Clustering (7) K-means clustering with 2 clusters of sizes 18, 14 Cluster means: disp hp 1 135.5389 98.05556 2 353.1000 209.21429 Clustering vector: [1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1 Within cluster sum of squares by cluster: [1] 58369.27 93490.74 (between_SS / total_SS = 75.6 %) 73
- 74. Clustering (8) x <- as.matrix(mtcars) heatmap(x, scale="column") 74
- 75. Time Series (1) First, make the data variable into a time series object # create time-series objects beer <- ts(beer, start=1956, freq=12) It is useful to be able to turn a time series into components. The function stl performs seasonal decomposition of a time series into seasonal, trend and irregular components using loess. 75
- 76. Time Series (2) The remainder component is the residuals from the seasonal plus trend fit. The bars at the right-hand side are of equal heights (in user coordinates). # Decompose a time series into seasonal, # trend and irregular components using loess ts.comp <- stl(beer, s.window="periodic") plot(ts.comp) 76
- 77. Time Series (3) 77
- 78. Programming (1) We can extend the functionality of R by writing a function that estimates the standard error of the mean SEM <- function(x, na.rm = FALSE) { if (na.rm == TRUE) VAR <- x[!is.na(x)] else VAR <- x SD <- sd(VAR) N <- length(VAR) SE <- SD/sqrt(N - 1) return(SE) } 78
- 79. Programming (2) You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments. "%p%" <- function(x,y) paste(x,y,sep=" ") "Hi" %p% "Khaled" [1] "Hi Khaled" 79
- 80. Programming (3) setwd("path/to/folder") sink("output.txt") cat("Intercept t Slope") a <- fit$coefficients[[1]] b <- fit$coefficients[[2]] cat(paste(a, b, sep="t")) sink() jpeg(filename="graph.jpg", width=600, height=600) plot(wt, mpg); abline(fit) dev.off() 80
- 81. Programming (4) The code for R functions can be viewed, and in most cases modified, if so is desired using fix() function. You can trigger garbage collection by call gc() function which will report few memory usage statistics. Basic tool for code timing is: system.time(commands) tempfile() give a unique file name in temporary writable directory deleted at the end of the session. 81
- 82. Programming (5) Take control of your R code! RStudio is a free and open source integrated development environment for R. You can run it on your desktop (Windows, Mac, or Linux) : Syntax highlighting, code completion, etc... Execute R code directly from the source editor Workspace browser and data viewer Plot history, zooming, and flexible image & PDF export Integrated R help and documentation and more (http://www.rstudio.com/ide/) 82
- 83. Programming (6) 83
- 84. Programming (7) If want to evaluate the quadratic x2−2x +4 many times so we can write a function that evaluates the function for a specific value of x: my.f <- function(x) { x^2 - 2*x + 4 } my.f(3) [1] 7 plot(my.f, -10, +10) 84
- 85. Programming (8) 85
- 86. Programming (9) We can find the minimum of the function using: optimize(my.f, lower = -10, upper = 10) $minimum [1] 1 $objective [1] 3 which says that the minimum occurs at x=1 and at that point the quadratic has value 3. 86
- 87. Programming (10) We can integrate the function over the interval -10 to 10 using: integrate(my.f, lower = -10, upper = 10) 746.6667 with absolute error < 4.1e-12 which gives an answer together with an estimate of the absolute error. 87
- 88. Programming (11) plot(my.f, -15, +15) v <- seq(-10,10,0.01) x <- c(-10,v,10) y <- c(0,my.f(v),0) polygon(x, y, col='gray') 88
- 89. Publication-Quality Output (1) Research doesn’t end when the last statistical analysis is completed. We need to include the results in a report. xtable function convert an R object to an xtable object, which can then be printed as a LaTeX table. LaTeX is a document preparation system for high- quality typesetting (http://www.latex-project.org). library(xtable) print(xtable(model)) 89
- 90. Publication-Quality Output (2) library(xtable) example(aov) print(xtable(npk.aov)) 90
- 91. Publication-Quality Output (3) ggplot2 package is an elegant alternative to the base graphics system, it has two complementary uses: Producing publication quality graphics using very simple syntax that it similar to that of base graphics. ggplot2 tends to make smart default choices for color, scale etc. Making more sophisticated/customized plots that go beyond the defaults. 91
- 92. Publication-Quality Output (4) 92
- 93. Final words! How Large is Your Family? How many brothers and sisters are there in your family including yourself? The average number of children in families was about 2. Can you explain the difference between this value and the class average? Birthday Problem! The problem is to compute the approximate probability that in a room of n people, at least two have the same birthday. 93
- 94. Online Resources http://tryr.codeschool.com http://www.r-project.org http://www.statmethods.net http://www.r-bloggers.com http://www.r-tutor.com http://blog.revolutionanalytics.com/r 94
- 95. Thank You 95