R Language Introduction


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • c() function is short for concatenateb > 4[1] FALSE FALSE FALSE TRUE TRUEFor complex conditions you can use logical operator where:! indicates logical negation (NOT) & indicate logical AND | indicate logical OR %in% operator searches through all of the entries in the object
  • help.start() Launch R HTML documentation# Argument List of a Functionargs(read.csv)function (file, header = TRUE, sep = ",", quote = "\\"", dec = ".", fill = TRUE, comment.char = "", ...) NULL
  • # User CommentR doesn't provide multiline or block comments, you must start each line of multiline comment with #For debugging purposes, you can also surround code that you want the interpreter to ignore with the statement if (FALSE) { … }
  • Packages -> Install package(s)… Select CRAN mirror, then browse all available packages on CRAN repositoryinstalled.packages() # List of all currently installed packagesinstall.packages(ggplot2) # Install package ggplot2 from CRAN mirrorNote: if you would like to be sure that you execute the function from specific package then you can use the full name like this: package::function()
  • You can use row.names optional parameter to specifying one variable (i.e. column name in your imported data set) to represent row identifier (like plot #)Set and get working directory:setwd("/path/to/your/directory")getwd()Note:setwd() function won’t create a directory that doesn’t exist. If necessary, you can use the dir.create() function to create new directory, and then use setwd()Read and execute R Code from an external file:source("filename.R")
  • The detach() function removes the data frame from the search path, it does nothing to the data frame itself. This function is optional but is good programming practice and should be included routinely (see also with() function).# List and remove objects:ls()rm(VAR1, VAR2)rm(list = ls())#How to add one more calculated column into your data frame:data <- transform(data, RYD=SYD/BYD)# Example on date using R language:startday <- as.Date("2002-08-15")today <- Sys.Date()days <- today - startday
  • Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector.a <- c(1, 2, 5, 3, 6, -2, 4)b <- c("one", "two", "three")c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)A matrix is a two-dimensional array where each element has the same mode (numeric,character, or logical). Matrices are created with the matrix function.y <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE)Arrays are similar to matrices but can have more than two dimensions. They are created with an array function.z <- array(1:24, c(2, 3, 4))A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, etc.). data <- data.frame(a, b, c, y, z)
  • #XMLlibrary(XML)cdCatalog <- xmlToDataFrame("http://www.w3schools.com/xml/cd_catalog.xml")countryCdCatalog <- split(cdCatalog, cdCatalog$COUNTRY)class(countryCdCatalog)[1] "list"names(countryCdCatalog)[1] "EU" "Norway" "UK" "USA" countryCdCatalog$EUNote: be sure that any missing data is properly coded as missing before analyzing the data or the results will be meaningless. For example if the value -999 refer to the missing observation in your Yield data, you can fix it using the following command:x[x == -999] <- NA
  • mean(x, trim=0.05, na.rm=TRUE)Provides the trimmed mean, dropping the highest and lowest 5 percent of scores as well as any missing values.summary() function will return frequencies for factors and logical vectors.sqrt(x) is the same as x^(0.5)Functions of the form is.datatype() return TRUE or FALSE, while functions of the form as.datatype() converts the argument to that type.Data types: numeric, character, logical, vector, factor, matrix, array, data.frameThe long/detailed way to calculate sd (i.e. standard deviation):n <- length(x)x.mean <- sum(x) / nss <- sum((x – x.mean)^2)x.sd <- sqrt(ss / (n – 1))
  • signif(24+pi/100, digits=6) # returns 24.0314 (i.e. round x to the specified number of significant digits)Sequence generation: seq(from, to) or seq(from, to, by)The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset (this can be useful in the bootstrapping technique): sample(x, n, replace=FALSE)sample(c("H", "T"), 10, replace=TRUE, prob=c(0.53, 0.47))Note: to ensure that all trainees will get the same randomization if they run the code on their own machines you may use: set.seed(123)x <- c(1, 4, 9, 16, 25, 36)diff(x) # returns c(3, 5, 7, 9, 11)Combine R objects by rows (i.e. rbind) or columns (i.e. cbind):X <- c(0, 1, 2, 3, 4)Y <- c(5, 6, 7, 8, 9)XY <- cbind(X, Y)
  • Back transformation:log(x) vs. exp(x)log10(x) vs. 10^xsqrt(x) vs. x^2# Scales (mean of 0 and sd of 1) values of x to ranks. # To only center data, use scale=FALSE# To only reduce data use center=FALSEscale(x, center=TRUE, scale=TRUE)# The appropriate representation of values such as # infinity and not a number (NaN) is providedx <- 1/0 # Inf-x # - Infx-x # NaN1/Inf # 0# Classical example show the numerical computing problema <- sqrt(2)a*a == 2 # FALSEa*a – 2 # 4.440892e-16
  • # Details of recycling example calculation:1 + 1 = 22 + 2 = 43 + 3 = 64 + 1 = 55 + 2 = 76 + 3 = 97 + 1 = 88 + 2 = 109 + 3 = 1210 + 1 = 11
  • # set the random seed to insure that you will get the same valuesset.seed(123)# generating a 6 x 5 matrix containing random normal variatesmydata <- matrix(rnorm(30), nrow=6)# calculate trimmed column means (in this case, means based on the middle %60# of the data, with the bottom 20 percent and top 20 percent of values discarded)apply(mydata, 2, mean, trim=0.2)
  • substr(month.name, 2, 3)[1] "an" "eb" "ar" "pr" "ay" "un" "ul" "ug" "ep" "ct" "ov" "ec"paste("*", month.name[1:4], "*", sep=" ")[1] "* January *" "* February *" "* March *" "* April *" letters[1:4][1] "a" "b" "c" "d"LETTERS[1:4][1] "A" "B" "C" "D"sub("\\\\s", ".", "Hello World") # returns "Hello.World"strsplit("Hello World", "\\\\s+") # returns list contains two elementsstrsplit(month.name, c("a", "e")) # the recycling rule# search for regular expression pattern in the string and returns matching indicesx <- regexpr("pattern", "string", perl=TRUE)
  • Skewness = 0 (Symmetric), positive (Skewed to right), negative (Skewed to left)Kurtosis is positive (Leptokurtic), Kurtosis is negative (Platykurtic)x <- c(2, 8, 1, 9, 7, 5)sort(x, decreasing=T)# 9 8 7 5 2 1rank(x)# 2 5 1 6 4 3which.min(x) # 3which.max(x)# 4
  • Where as pie charts are ubiquitous in the business world, they are denigrated by most of statisticians. They recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume.To add colors to your categorized boxplot you can try this: plot(cyl, mpg, col=rainbow(nlevels(cyl)))Other vectors of contiguous colors includes: heat.colors(), terrain.colors(), topo.colors(), and cm.colors() For the gray levels you can use something like this: gray(0:n/n) where n <- nlevels(cyl)
  • Mathematical symbols: You can use expression function to display the text may contain mathematical symbols (i.e. use it in xlab, ylab, or main, etc…)expression(frac(mu,sqrt(2*pi*sigma^2)))The log parameter in plot function indicates whether or which axes should be plotted on a logarithmic scale:log="x", log="y", or log="xy" for Log x-axis scale, Log y-axis scale, or Log x-axis and y-axis scales respectivelytck option in plot function enable you to define the length of tick mark as a fraction of plotting region (a negative number is outside the graph, a positive numbers is inside, 0 suppresses ticks, and 1 creates gridlines); the default is -0.01
  • mtext() function places text in one of the four margins. The format is: mtext("text to plcae", side=n, line=m, ...)Where side define which margin to place text in (1=bottom, 2=left, 3=top, 4=right), while line indicate the line in the margin starting with 0 (closest to the plot area) and moving out.To create a plot based on probability densities rather than frequencies:hist(qsec, col="gray", probability = TRUE)lines(density(qsec), col = "red", lwd = 3)You can define how many breaks are there in your histogram using breaks option (i.e. breaks=20)You can draw a stand alone density plot (that’s not being superimposed on another graphs) using the following command:plot(density(qsec))
  • The boxplot summarizes a great deal of information very clearly. The horizontal line shows the median. The bottom and top of the box show the 25th and 75th percentiles, respectively. The vertical dashed lines show one of two things: either the maximum value or 1.5 times the interquartile range of the data (roughly 2 standard deviations). Points more than 1.5 times the interquartile range above and below are defined as outliers and plotted individually.Boxplot can be created for variables by group using formula instead of name of variable alone, example: y ~ A (i.e. a separate boxplot for numiric variable y is generated for each value of categorical variable A), while y ~ A*B formula would produce boxplot for each combination of levels in categorical variables A and B.quantile(qsec) 0% 25% 50% 75% 100% 14.5000 16.8925 17.7100 18.9000 22.9000 quantile(qsec, pro=seq(0, 1, 0.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 14.500 15.534 16.734 17.020 17.340 17.710 18.180 18.607 19.332 19.990 22.900
  • # ICARDA Tel-Hadya FarmLONG <- c(35.99931833,36.00396667,36.03403667,36.02687333,36.025495,36.00249667,35.99931667,36.00312667,36.00401,35.99931833)LAT <- c(36.93153667,36.91863167,36.939475,36.96093,36.9622,36.96005167,36.947595,36.94750667,36.93485667,36.93153667)
  • You can display the relationship between three quantitative variables using 2D scatter plot and use the size of the plotted point to represent the value of the third variable. This approach is referred to as a bubble plot. You want the areas, rather than the radiuses of the circles, to be proportional to the values of a third variable. Given the formula for the radius of a circle r = sqrt(a/pi) the proper call:r <- sqrt(disp[1:10]/pi)symbols(wt[1:10], mpg[1:10], circle=r, inches=0.30, fg="white", bg="lightblue", main="Bubble Plot with point size proportional to disp", xlab="Weight (lb/1000)", ylab="Miles/(US) gallon")text(wt[1:10], mpg[1:10], rownames(mtcars[1:10,]), cex=0.6)# 3D graph code/equationrequire(lattice)g <- expand.grid(x = seq(-10, 10, 0.1), y = seq(-10, 10, 0.1))g$z <- cos(sqrt(g$x^2 + g$y^2))*(1/(g$x^2 + g$y^2)^(1/3))wireframe(z ~ x * y, data = g, scales = list(arrows = FALSE), shade = TRUE)
  • r = cov(x, y) / sdx * sdyr is called the correlation coefficient, numerator is called the covariance, two terms in the denominator are the standard deviation of x and ycov(x, y) = [1 / (n – 1)] ∑ (x – meanx)(y – meany)Where n is the number of observationsNote: you can also examining all bi-variate relationships in a given data frame in one go using: cor(data)
  • names(cor.test(wt, qsec))[1] "statistic" "parameter" "p.value" "estimate" "null.value" "alternative" "method" "data.name" "conf.int" cor.test(wt, qsec)$p.value[1] 0.3389cor.test(wt, qsec)$statistic t -0.9719cor.test(wt, qsec)$estimate cor -0.1747159
  • Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. In scientific notation all numbers are written in the form of a times ten raised to the power of b where the exponent b is an integer, and the coefficient a is any real number:1.294e-10 = 1.294 * 10 ^ -10 = 1.294 / 10 ^ 10 = 0.0000000001294"Correlation does not imply causation" is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other.
  • Polynomial Regression (i.e. y = a + b*x + c*x2)quadratic <- lm(y ~ x + I(x^2))summary(quadratic)# to plot it:plot(x, y)x2 <- sort(x)y2 <- fitted(quadratic)[order(x)]lines(x, fitted(quadratic))Mathematical functions can be used in formulas. For example: log(y) ~ x + z + w would predict log(y) from x, z, and w.y ~ log(x) + sin(z) would predict y = a + b * log(x) + c * sin(z)
  • To add a label for each data point in the graph: text(wt, mpg, row.names(mtcars), cex=0.5, pos=4, col="red")Change font name and font size:par(family="serif", ps=12)Using the identify() function , you can label selected points in a scatter plot with their row number or row name using your mouse.identify(wt, mpg, labels=rownames(mtcars))the cursor will change from a pointer to a crosshair. Clicking on scatter plot points will label them until you select Stop from the Graphics Device menu or right-click on the graph and select Stop from the context menu.# Confidence and prediction bands:x <- seq(min(wt),max(wt),length=100)p <- predict(fit, data.frame(wt=x), interval='prediction')lines(x, p[,2], col='red')lines(x, p[,3], col='red')p <- predict(fit, data.frame(wt=x), interval='confidence')lines(x, p[,2], col='red', lty=2)lines(x, p[,3], col='red', lty=2)
  • In fact, dropping some observation (outliers) produces a better model fit. But you need to be careful when deleting data. Your models should fit your data, not the other way around! In other cases, the unusual observation may be the most interesting thing about the data you have collected.“Which variables are most important in predicting the outcome?” You implicitly want to rank-order the predictors in terms of relative importance. There have been many attempts to develop a means for assessing the relative importance of predictors. The simplest has been to compare standardized regression coefficients. Standardized regression coefficients describe the expected change in the response variable (expressed in standard deviation units) for a standard deviation change in a predictor variable, holding the other predictor variables constant.Reference: Listing 8.16 ("R in Action" book) relweights() function for calculating relative importance of predictors
  • LD50 is the median lethal dose of a toxic substance, i.e., that dose of a chemical which kills half the members of a tested population. Basically, what we have is a predictor that is the dose of a chemical and a binary response variable that indicates whether the individual dies or not. The data consist of numbers dead and initial batch size for several doses (e.g. pesticide application), and we wish to know what dose kills 50% of the individuals.dead <- c( 0, 10, 16, 53, 76, 83)dose <- c( 1, 2, 3, 5, 10, 20)batch <- c(85, 85, 85, 85, 85, 85)y <- cbind(dead, batch-dead)model <- glm(y ~ dose, binomial)plot(dose, dead/batch)xv <- seq(0, 20, 0.1)yv <- predict(model, list(dose=xv), type="response")lines(xv, yv)Predict Doses for Binomial Assay model: the function dose.p from the MASS library is run with the model object, specifying the proportion killed.library(MASS)dose.p(model, p=c(0.5,0.9,0.95))
  • # function to obtain R-Squared from the data rsq <- function(formula, data, indices) { d <- data[indices,] # allows boot to select sample fit <- lm(formula, data=d) return(coef(fit)) # Bootstrapping several Statistics} # bootstrapping with 1000 replications results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp, parallel="multicore") # Linux# view resultsresults plot(results, index=1) # intercept plot(results, index=2) # wt plot(results, index=3) # disp# get 95% confidence interval boot.ci(results, type="bca", index=1) # intercept boot.ci(results, type="bca", index=2) # wt boot.ci(results, type="bca", index=3) # disp
  • un-paired case: ab.t = (mean(a)-mean(b))/sqrt(var(a)/length(a) + var(b)/length(b))paired case: ab.t = mean(a-b) / sqrt(var(a-b) / length(a-b))Luckily most numeric functions have a na.rm=TRUE option that removes missing values prior to calculations, and applies the function to the remaining values.
  • (a) Test the equality of variances assumption:if ev > 0.05 we have to use var.equal=TRUE option in the t.test else use var.equal=FALSE (the default value)(b) Test the normality assumption:if an < 0.05 or bn < 0.05 then we have to use wilcox.test instead of t.test
  • You can turn your frequencies table into proportions using prop.table() function: prop.table(myTable)You can transpose Matrix using: t(myTable)Note: Attributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A matrix is represented as an object/vector of data with "dim" attribute, in this example there is also an extra attribute called "dimnames“rownames(myTable) <- c("Automatic", "Manual") # see also colnames function
  • The test is not applicable if the expected count for any of the cells is less than 5. R will warn you if this is the case and suggest that the validity of the test results is questionable.
  • Mosaic Plots are the swiss army knife of categorical data displays. Whereas bar charts are stuck in their univariate limits, mosaic plots and their variants open up the powerful visualization of multivariate categorical data.
  • # If you import data from Turkey Excel file we have to use dec="," # Data file name is "2F RCB.csv"data<-read.csv(file.choose(), header=TRUE, sep=";", dec=".")attach(data)You can check factor levels using this function: levels(x)You can check the number of factor levels using this function: nlevels(x)Note: by default, character variables are converted to factors when importing data, to suppress this include the option in the read.table function:stringAsFactors=FALSENote: you can undo factor function effect by using as.numeric(x) or as.character(x) functions depends on vector data typeAttributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A factor is represented as an object/vector of data with two extra attributes $levels with a list of distinct values and $class"factor"In factor function, if ordered argument is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.
  • A graph of yield vs. S at the 3 levels of N seems to indicate a classical nutrient response interaction: no response to S at 0 N, contrasted by a strong response to S when N is no limiting.Experiment design: Two factor factorial RCBD in 3 repsTreatments, 3 x 4 factorial: 12 treatments of all possible combinations of two factors, nitrogen N (3 levels: 0, 180 and 230 kg/ha) and sulphur S (4 levels: 0, 10, 20 and 40 kg/ha)
  • Broad-sense heritability (h2) of the trait is the ratio of genetic variability (σ2g) to phenotypic variability (σ2g + σ2e). Generally, estimation of variance components is based on ANOVA table:σ2e = Residual Mean Sqσ2g = (Genotypes Mean Sq – Residual Mean Sq) / ReplicationsThus an estimate of heritability is: h2 = (VR – 1) / (VR + Replications – 1)Where VR is the variance ratio for genotypesaov.table <- summary(model)[[1]]svr<- aov.table$"F value"[2]h2 <- (svr - 1) / (svr + nlevels(Rep) - 1)
  • Symbols commonly used in R formulas:~ Separate response on the left from the explanatory variables on the right+ Separate explanatory variables: Denotes an interaction between predictor variables* A shortcut for denoting all possible interactions^ Denotes interactions up to a specified degree. The code y ~ (x + z + w)^2 expands to y ~ x + z + w + x:z + x:w + z:w. A place holder for all other variables in the data frame except the dependent variable- A minus sign removes a variable from the equation. For example, y ~ (x + z + w)^2 – x:w expands to y ~ x + z + w + x:z + z:w-1 Suppresses the intercept. For example, the formula y ~ x -1 fits a regression of y on x, and forces the line through the origin at x=0
  • The order in which the effects appear in a formula matters only when:There’s more than one factor and the design is unbalanced (i.e. the model y~A*B will not produce the same results as the model y~B*A)Covariates are present (i.e. covariates should be listed first, followed by main effects, followed by two-way interactions, and so on)
  • PCA was invented in 1901 by Karl Pearson. Principal component analysis is a variable reduction procedure. It is useful when you have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. Thus, the objective of principal component analysis is to reduce the dimensionality of the data set and to identify meaningful underlying variables. It is more useful as a visualization tool than as an analytical method. The basic idea in PCA is to find the components that explain the maximum amount of variance in original variables by few linearly transformed uncorrelated components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
  • (important) scale option in the prcomp function: * TRUE: PCA based on correlation matrix * FALSE: PCA based on covariance matrix (default)PC1 = - 0.059 * wt - 0.832 * disp - 0.406 * hp + 0.369 * mpeg + 0.062 * qsecPC2 = - 0.05 * wt + 0.475 * disp - 0.832 * hp + 0.122 * mpeg + 0.255 * qsec
  • model2 <- prcomp(d2, scale=TRUE)summary(model2)Importance of components: PC1 PC2 PC3Standard deviation 1.9227 0.9803 0.39310Proportion of Variance 0.7394 0.1922 0.03091Cumulative Proportion 0.7394 0.9316 0.96247
  • The angles between biplot vectors (arrows going from origin to factor loading coordinates) clearly show the relationships between the plant attributes measured during the trial (the cosine of the angle between any 2 vectors approximates their correlation). To estimate the level of any variable in any genotype, draw a perpendicular line from the genotype score to the biplot vector of interest.Reading from the biplot we can summarize as follows: while both cars 25 and 30 have roughly same level of "hp", but car 25 has much higher "disp" that car 30Note: PC1 explains > 89% of the variance in the dataset, leaving much less for PC2 to explain.
  • the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of:* In “single” linkage method, distance between clusters is taken as distance between the closest neighbors and in “complete” linkage the distance between farthest neighbors determines distance between clusters. * “average” linkage defines the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to cluster1 and the other member of the pair belongs to cluster2. * In “centroid” linkage the distance between clusters is defined as distance between the centers of the clusters. Thus, groups once formed are represented by their mean values for each variable, that is, by their mean vector and inter-cluster distance is the distance between two such mean vectors.* In “ward” method at each step in the analysis, union of every possible pair of clusters is considered and the two clusters whose fusion results in the minimum increase in the information loss are combined. Ward defines an information loss in terms of error sum of squares (ESS) criterion.
  • str(g)List of 4 $ : Named int [1:7] 5 12 13 14 22 23 25 ..- attr(*, "names")= chr [1:7] "Hornet Sportabout" "Merc 450SE" "Merc 450SL" "Merc 450SLC" ... $ : Named int [1:7] 7 15 16 17 24 29 31 ..- attr(*, "names")= chr [1:7] "Duster 360" "Cadillac Fleetwood" "Lincoln Continental" "Chrysler Imperial" ... $ : Named int [1:6] 18 19 20 26 27 28 ..- attr(*, "names")= chr [1:6] "Fiat 128" "Honda Civic" "Toyota Corolla" "Fiat X1-9" ... $ : Named int [1:12] 1 2 3 4 6 8 9 10 11 21 ... ..- attr(*, "names")= chr [1:12] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  • A dendrogram is plotted to show the hierarchical relationships between the units, which is ordered according to the results of the cluster analysis.
  • Note: Missing values in kmeans is not accepted!
  • Monthly beer production in Australia from Jan. 1956 to Aug. 1995data <- read.csv("C:/R Examples/beer.csv", header=TRUE, sep=";", dec=",")attach(data)Before:> beer [1] 93.2 96.0 95.2 77.1 70.9 64.8 70.1 77.3 79.5 100.6 100.7 107.1 [13] 95.9 82.8 83.3 80.0 80.4 67.5 75.7 71.1 89.3 101.1 105.2 114.1 [25] 96.3 84.4 91.2 81.9 80.5 70.4 74.8 75.9 86.3 98.7 100.9 113.8 ...After:> beer Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1956 93.2 96.0 95.2 77.1 70.9 64.8 70.1 77.3 79.5 100.6 100.7 107.11957 95.9 82.8 83.3 80.0 80.4 67.5 75.7 71.1 89.3 101.1 105.2 114.11958 96.3 84.4 91.2 81.9 80.5 70.4 74.8 75.9 86.3 98.7 100.9 113.8
  • > ts.comp Call:stl(x = beer, s.window = "periodic")Components seasonal trend remainderJan 1956 3.571867 91.68633 -2.05819219Feb 1956 -5.267166 90.80817 10.45899171Mar 1956 6.091303 89.93002 -0.82132590Apr 1956 -7.833797 89.10416 -4.17036590May 1956 -11.326397 88.27830 -6.05190626
  • The standard error of the mean can be estimated using the formula sd/√(n − 1), where sd is the standard deviation of the sample and n is the number of observations.The function first assesses whether missing values (values of 'NA') should be removed (based on the value of na.rm supplied by the function user). If the function is called with na.rm=TRUE, the is.na() function is used to deselect such values, before the standard deviation and length are calculated using the sd() and length() functions. Finally, the standard error of the mean is calculated and returned.Note: you can use either explicit return command, or the value returned by the function will be the value of the last statement executed.You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments."%p%" <- function(x, y) paste(x, y, sep=" ")"Hi" %p% "Khaled" # "Hi Khaled" To combine more than one value in the returned result:result <- list(xname=x, yname=y)return(result) Note: in this case if value is returned then you can check for value$xname and value$yname or value[["xname"]] and value[["yname"]]If a <- "xname" then value$a will not work, while value[[a]] will work
  • In the sink() function you can also:* use option append=TRUE to append text to the file rather than overwriting it.* use option split=TRUE will send output to both the screen and file.Data Output:write.table(DATA, "data.csv", quote = F, row.names = T, sep = ",")In addition to jpeg(), you can use the functions pdf(), win.metafile(), png(), bmp(), tiff(), xfig(), and postscript() to save graphics in other formats.You can run R script file non-interactively and send output to another file.R CMD BATCH [options] script.R [out-file]
  • sum function for example is primitive function and written in C language for performance issue and can’t be viewed in this manner, while cor function like most R functions are written in R itself.The ifelse construct is a compact and vectorized version of the if-else construct:y <- ifelse(x<0, 0, log(x))Error raised by a call to stop("your message")Warning raised by a call to warning("your message")class(mtcars) # "data.frame"typeof(mtcars)# "list"object.size(mtcars)# 5336 bytesstr(mtcars)
  • Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. To learn more about Sweave, visit the Sweave home page (www.stat.uni-muenchen.de/~leisch/Sweave/). To learn more about LaTeX you can start here: http://www.latex-project.org/intro.html
  • You can use TEXworks software to render the LaTeX tags in PDF format, TEXworks lowering the entry barrier to the TEX world, it is also free and open source package and you can get it from: http://www.tug.org/texworks/To get a valid render, xtableLaTeX output should be inserted into LaTeX document template such as the following simple one:\\documentclass{article}\\usepackage[utf8]{inputenc}\\usepackage[frenchb]{babel}\\begin{document}% Your LaTeX goes here\\end{document}
  • The format is: qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=)where the parameters/options are defined below:alphaAlpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity).data Specifies a data frame.main, sub Character vectors specifying the title and subtitle.x , y Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y.xlab, ylabCharacter vectors specifying horizontal and vertical axis labels.xlim , ylimTwo-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.color, shape, size, fillAssociates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.facets Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar (see the example in figure 16.10). To create trellis graphs based on a single conditioning variable, use rowvar~. or .~colvar.geomSpecifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".method, formulaIf geom="smooth", a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. The formula parameter gives the form of the fit.For example, to add simple linear regression lines, you’d specify: geom="smooth", method="lm", formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.
  • library(ggplot2)attach(mtcars)am <- factor(am, labels=c("automatic", "manual"))qplot(wt, mpg, shape=20, color=am, main="1974 Motor Trend US magazine (Piece of cake!)", xlab="Weight (lb/1000)", ylab="Miles/(US) gallon", geom=c("point", "smooth"), method="lm")
  • How Large is Your Family?The reason that the estimate is wrong is that families with 0 children could not have sent any to the class! So the average calculated is a random sample when sampling by child and not by family. In this case families with a large number of children are sampled more often - one time for each child.Birthday Problem!Ṕ(n) = 1 x (1 – 1/365) x (1 – 2/365) x ... x (1 – (n – 1)/365)The equation expresses the fact that the first person has no one to share a birthday, the second person cannot have the same birthday as the first (364/365), the third cannot have the same birthday as the first two (363/365), and in general the n th birthday cannot be the same as any of the n − 1 preceding birthdays.The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability P(n) is:P(n) = 1 – Ṕ(n)This probability surpasses 1/2 for n = 23 (with value about 50.7%). For more information:http://en.wikipedia.org/wiki/Birthday_problem
  • Please don’t hesitate to contact us if you have any question, comment or feedback related to this session <khaled.alshamaa@gmail.com>Japanese attitude for work: If one can do it, I can do it. If no one can do it, I must do it.Middle Eastern attitude for work:Wallahi … if one can do it, let him do it. If no one can do it, ya-habibi how can I do it?
  • R Language Introduction

    1. 1. Khaled El-Sham’aa 1
    2. 2. Session Road Map First Steps  ANOVA Importing Data into R  PCA R Basics  Clustering Data Visualization  Time Series Correlation & Regression  Programming t-Test  Publication-Quality output Chi-squared Test 2
    3. 3. First Steps (1) R is one of the most popular platforms for data analysis and visualization currently available. It is free and open source software: http://www.r-project.org Take advantage of its coverage and availability of new, cutting edge applications/techniques. R will enable us to develop and distribute solutions to our NARS with no hidden license cost. 3
    4. 4. First Steps (2) 4
    5. 5. First Steps (3)5 * 4 b[4][1] 20 [1] 5a <- (3 * 7) + 1 b[1:3]a [1] 1 2 3[1] 22 b[c(1,3,5)]b <- c(1, 2, 3, 5, 8) [1] 1 3 8b * 2[1] 2 4 6 10 16 b[b > 4] [1] 5 8 5
    6. 6. First Steps (4) citation() R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. 6
    7. 7. First Steps (5) If you know the name of the function you want help with, you just type a question mark ? at the command line prompt followed by the name of the function: ?read.table 7
    8. 8. First Steps (6) Sometimes you cannot remember the precise name of the function, but you know the subject on which you want help. Use the help.search function with your query in double quotes like this: help.search("data input") 8
    9. 9. First Steps (7) To see a worked example just type the function name:example(mean)mean> x <- c(0:10, 50)mean> xm <- mean(x)mean> c(xm, mean(x, trim = 0.10))[1] 8.75 5.50mean> mean(USArrests, trim = 0.2) Murder Assault UrbanPop Rape 7.42 167.60 66.20 20.16 9
    10. 10. First Steps (8) There are hundreds of contributed packages for R, written by many different authors (to implement specialized statistical methods). Most are available for download from CRAN (http://CRAN.R-project.org) List all available packages: library() Load package “ggplot2”: library(ggplot2) Documentation on package library(help=ggplot2) 10
    11. 11. Importing Data into R (1) data <- read.table("D:/path/file.txt", header=TRUE) data <- read.csv(file.choose(), header=TRUE, sep=";") data <- edit(data) fix(data) head(data) tail(data) tail(data, 10) 11
    12. 12. Importing Data into R (2) In order to refer to a vector by name with an R session, you need to attach the dataframe containing the vector. Alternatively, you can refer to the dataframe name and the vector name within it, using the element name operator $ like this: mtcars$mpg ?mtcars attach(mtcars) mpg 12
    13. 13. Importing Data into R (3) 13
    14. 14. Importing Data into R (4)# Read data left on the clipboarddata <- read.table("clipboard", header=T)# ODBClibrary(RODBC)db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd")raw <- sqlQuery(db1, "SELECT * FROM table1")# XLSXlibrary(XLConnect)xls <- loadWorkbook("my_file.xlsx", create=F)raw <- as.data.frame(readWorksheet(xls,sheet=Sheet1)) 14
    15. 15. R Basics (1) max(x) maximum value in x min(x) minimum value in x mean(x) arithmetic average of the values in x median(x) median value in x var(x) sample variance of x sd(x) standard deviation of x cor(x,y) correlation between vectors x and y summary(x) generic function used to produce result summaries of the results of various functions 15
    16. 16. R Basics (2) abs(x) absolute value floor(2.718) largest integers not greater than ceiling(3.142) smallest integer not less than x asin(x) inverse sine of x in radians round(2.718, digits=2) returns 2.72 x <- 1:12; sample(x) Simple randomization RCBD randomization: RCBD <- replicate(3, sample(x)) 16
    17. 17. R Basics (3)Common Data Transformation:Nature of Data Transformation R SyntaxMeasurements (lengths, weights, etc) loge log(x) log10 log(x, 10) Log10 log10(x) Log x+1 log(x + 1)Counts (number of individuals, etc) sqrt(x)Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi* where x is the name of the vector (variable) whose values are to be transformed. 17
    18. 18. R Basics (4) Vectorized computations: Any function call or operator apply to a vector in will automatically operates directly on all elements of the vector. nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8 The recycling rule: The shorter vector is replicated enough times so that the result has the length of the longer vector, then the operator is applied. 1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11 18
    19. 19. R Basics (5)mydata <- matrix(rnorm(30), nrow=6)mydata# calculate the 6 row meansapply(mydata, 1, mean)# calculate the 5 column meansapply(mydata, 2, mean)apply(mydata, 2, mean, trim=0.2) 19
    20. 20. R Basics (6) String functions:substr(month.name, 2, 3)paste("*", month.name[1:4], "*", sep=" ")x <- toupper(dna.seq)rna.seq <- chartr("T", "U", x)comp.seq <- chartr("ACTG", "TGAC", dna.seq) 20
    21. 21. R Basics (7) Surprisingly, the base installation doesn’t provide functions for skew and kurtosis, but you can add your own: m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n – 3 21
    22. 22. Data Visualization (1) Pairs for a matrix of scatter plots of every variable against every other: ?mtcars pairs(mtcars) Voilà! 22
    23. 23. Data Visualization (2)pie(table(cyl)) barplot(table(cyl)) 23
    24. 24. Data Visualization (3) Gives a scatter plot if x is continuous, and a box-and- whisker plot if x is a factor. Some people prefer the alternative syntax plot(y~x): attach(mtcars) plot(wt, mpg) plot(cyl, mpg) cyl <- factor(cyl) plot(cyl, mpg) 24
    25. 25. Data Visualization (4) 25
    26. 26. Data Visualization (5) Histograms show a frequency distribution hist(qsec, col="gray") 26
    27. 27. Data Visualization (6) boxplot(qsec, col="gray") boxplot(qsec, mpg, col="gray") 27
    28. 28. Data Visualization (7)XY <- cbind(LAT, LONG)plot(XY, type=l)library(sp)XY.poly <- Polygon(XY)XY.pnt <- spsample(XY.poly, n=8, type=random)XY.pntpoints(XY.pnt) 28
    29. 29. Data Visualization (8) 29
    30. 30. Correlation and Regression (1) If you want to determine the significance of a correlation (i.e. the p value associated with the calculated value of r) then use cor.test rather than cor. cor(wt, mpg) [1] -0.8676594 The value will vary from -1 to +1. A -1 indicates perfect negative correlation, and +1 indicates perfect positive correlation. 0 means no correlation. 30
    31. 31. Correlation and Regression (2)cor.test(wt, qsec) Pearsons product-moment correlationdata: wt and qsect = -0.9719, df = 30, p-value = 0.3389alternative hypothesis: true correlation is not equal to 095 percent confidence interval: -0.4933536 0.1852649sample estimates: cor-0.1747159 31
    32. 32. Correlation and Regression (3)cor.test(wt, mpg) Pearsons product-moment correlationdata: wt and mpgt = -9.559, df = 30, p-value = 1.294e-10alternative hypothesis: true correlation is not equal to 095 percent confidence interval: -0.9338264 -0.7440872sample estimates: cor-0.8676594 32
    33. 33. Correlation and Regression (4) 33
    34. 34. Correlation and Regression (5) Fits a linear model with normal errors and constant variance; generally this is used for regression analysis using continuous explanatory variables. fit <- lm(y ~ x) summary(fit) plot(x, y) # Sample of multiple linear regression fit <- lm(y ~ x1 + x2 + x3) 34
    35. 35. Correlation and Regression (6)Call:lm(formula = mpg ~ wt)Residuals: Min 1Q Median 3Q Max-4.5432 -2.3647 -0.1252 1.4096 6.8727Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***wt -5.3445 0.5591 -9.559 1.29e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 3.046 on 30 degrees of freedomMultiple R-squared: 0.7528, Adjusted R-squared: 0.7446F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 35
    36. 36. Correlation and Regression (7) The great thing about graphics in R is that it is extremely straightforward to add things to your plots. In the present case, we might want to add a regression line through the cloud of data points. The function for this is abline which can take as its argument the linear model object: abline(fit) Note: abline(a, b) function adds a regression line with an intercept of a and a slope of b 36
    37. 37. Correlation and Regression (8)plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon")abline(fit, col="blue", lwd=2)text(4, 25, "mpg = 37.29 - 5.34 wt") 37
    38. 38. Correlation and Regression (9) Predict is a generic built-in function for predictions from the results of various model fitting functions: predict(fit, list(wt = 4.5)) [1] 13.23500 38
    39. 39. Correlation and Regression (10) 39
    40. 40. Correlation and Regression (11) What do you do if you identify problems? There are four approaches to dealing with violations of regression assumptions:  Deleting observation  Transforming variables  Adding or deleting variables  Using another regression approach 40
    41. 41. Correlation and Regression (12) You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. fit1 <- lm (y ~ A + B + C) fit2 <- lm (y ~ A + C) anova(fit1, fit2) If the test is not significant (i.e. p > 0.05), we conclude that B in this case don’t add to the linear prediction and we’re justified in dropping it from our model. 41
    42. 42. Correlation and Regression (13)# Bootstrap 95% CI for R-Squaredlibrary(boot)rsq <- function(formula, data, indices) { fit <- lm(formula, data= data[indices,]) return(summary(fit)$r.square)}rs <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp)boot.ci(rs, type="bca") # try print(rs) and plot(rs) 42
    43. 43. t-Test (1) Comparing two sample means with normal errors (Student’s t test, t.test) t.test(a, b) t.test(a, b, paired = TRUE) # alternative argument options: # "two.sided", "less", "greater" a <- qsec[cyl == 4] b <- qsec[cyl == 6] c <- qsec[cyl == 8] 43
    44. 44. t-Test (2)t.test(a, b) Welch Two Sample t-testdata: a and bt = 1.4136, df = 12.781, p-value = 0.1814alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.6159443 2.9362040sample estimates:mean of x mean of y 19.13727 17.97714 44
    45. 45. t-Test (3)t.test(a, c) Welch Two Sample t-testdata: a and ct = 3.9446, df = 17.407, p-value = 0.001005alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: 1.102361 3.627899sample estimates:mean of x mean of y 19.13727 16.77214 45
    46. 46. t-Test (4)(a) Test the equality of variances assumption:ev <- var.test(a, c)$p.value(b) Test the normality assumption:an <- shapiro.test(a)$p.valuebn <- shapiro.test(c)$p.value 46
    47. 47. Chi-squared Test (1)Construct hypotheses based on qualitative – categorical data:myTable <- table(am, cyl)myTable cylam 4 6 8 automatic 3 4 12 manual 8 3 2 47
    48. 48. Chi-squared Test (2)chisq.test(myTable) Pearsons Chi-squared testdata: myTableX-squared = 8.7407, df = 2, p-value = 0.01265The expected counts under the null hypothesis:hisq.test(myTable)$expected cylam 4 6 8 automatic 6.53125 4.15625 8.3125 manual 4.46875 2.84375 5.6875 48
    49. 49. Chi-squared Test (3)mosaicplot(myTable, color=rainbow(3)) 49
    50. 50. ANOVA (1) A method which partitions the total variation in the response into the components (sources of variation) in the above model is called the analysis of variance. table(N, S, Rep) N <- factor(N) S <- factor(S) Rep <- factor(Rep) 50
    51. 51. ANOVA (2) The best way to understand the two significant interaction terms is to plot them using interaction.plot like this:interaction.plot(S, N, Yield) 51
    52. 52. ANOVA (3)boxplot(Yield~N, col="gray") 52
    53. 53. ANOVA (4)model <- aov(Yield ~ N * S) #CRDsummary(model) Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.2909 42.7469 1.230e-08 ***S 3 0.9798 0.3266 6.0944 0.003106 **N:S 6 0.6517 0.1086 2.0268 0.101243Residuals 24 1.2862 0.0536---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 53
    54. 54. ANOVA (5)par(mfrow = c(2, 2))plot(model)ANOVA assumptions: Normality Linearity Constant variance Independence 54
    55. 55. ANOVA (6)model.tables(model, "means")Tables of meansGrand mean1.104722N 0 180 2300.6025 1.3142 1.3975S 0 10 20 400.8289 1.1556 1.1678 1.2667 SN 0 10 20 40 0 0.5600 0.7733 0.5233 0.5533 180 0.8933 1.2900 1.5267 1.5467 230 1.0333 1.4033 1.4533 1.7000 55
    56. 56. ANOVA (7)model.tables(model, se=TRUE).......Standard errors for differences of means N S N:S 0.0945 0.1091 0.1890replic. 12 9 3Plot.design(Yield ~ N * S) 56
    57. 57. ANOVA (8)mc <- TukeyHSD(model, "N", ordered = TRUE); mc Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been orderedFit: aov(formula = Yield ~ N * S)$N diff lwr upr p adj180-0 0.71166667 0.4756506 0.9476827 0.0000003230-0 0.79500000 0.5589840 1.0310160 0.0000000230-180 0.08333333 -0.1526827 0.3193494 0.6567397 57
    58. 58. ANOVA (9)plot(mc) 58
    59. 59. ANOVA (10)summary(aov(Yield ~ N * S + Error(Rep))) #RCBError: Rep Df Sum Sq Mean Sq F value Pr(>F)Residuals 2 0.30191 0.15095Error: Within Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.2909 51.2035 5.289e-09 ***S 3 0.9798 0.3266 7.3001 0.001423 **N:S 6 0.6517 0.1086 2.4277 0.059281 .Residuals 22 0.9843 0.0447---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 59
    60. 60. ANOVA (11) In a split-plot design, different treatments are applied to plots of different sizes. Each different plot size is associated with its own error variance. The model formula is specified as a factorial, using the asterisk notation. The error structure is defined in the Error term, with the plot sizes listed from left to right, from largest to smallest, with each variable separated by the slash operator /. model <- aov(Yield ~ N * S + Error(Rep/N)) 60
    61. 61. ANOVA (12)Error: Rep Df Sum Sq Mean Sq F value Pr(>F)Residuals 2 0.30191 0.15095Error: Rep:N Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.29088 55.583 0.001206 **Residuals 4 0.1649 0.04122Error: Within Df Sum Sq Mean Sq F value Pr(>F)S 3 0.97983 0.32661 7.1744 0.002280 **N:S 6 0.65171 0.10862 2.3860 0.071313 .Residuals 18 0.81943 0.04552 61
    62. 62. ANOVA (13) Analysis of Covariance: # f is treatment factor # x is variate acts as covariate model <- aov(y ~ x * f) Split both main effects into linear and quadratic parts. contrasts <- list(N = list(lin=1, quad=2), S = list(lin=1, quad=2)) summary(model, split=contrasts) 62
    63. 63. PCA (1) The idea of principal components analysis (PCA) is to find a small number of linear combinations of the variables so as to capture most of the variation in the dataframe as a whole.d2 <- cbind(wt, disp/10, hp/10, mpg, qsec)colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec") 63
    64. 64. PCA (2)model <- prcomp(d2)modelStandard deviations:[1] 14.6949595 3.9627722 2.8306355 1.1593717Rotation: PC1 PC2 PC3 PC4wt -0.05887539 0.05015401 -0.07513271 -0.16910728disp -0.83186362 0.47519625 0.28005113 0.04080894hp -0.40572567 -0.83180078 0.24611265 -0.28768795mpeg 0.36888799 0.12190490 0.91398919 -0.09385946qsec 0.06200759 0.25479354 -0.14134625 -0.93710373 64
    65. 65. PCA (3)summary(model)Importance of components: PC1 PC2 PC3Standard deviation 14.6950 3.96277 2.83064Proportion of Variance 0.8957 0.06514 0.03323Cumulative Proportion 0.8957 0.96082 0.99405 65
    66. 66. PCA (4)plot(model) biplot(model) 66
    67. 67. Clustering (1) We define similarity on the basis of the distance between two samples in this n-dimensional space. Several different distance measures could be used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the dist function: rownames(d2) <- rownames(mtcars) my.dist <- dist(d2, method="euclidian") 67
    68. 68. Clustering (2) Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster (see ?hclust for details). my.hc <- hclust(my.dist, "ward") 68
    69. 69. Clustering (3) We can plot the object called my.hc, and we specify that the leaves of the hierarchy are labeled by their plot numbers plot(my.hc, hang=-1) g <- rect.hclust(my.hc, k=4, border="red") Note: When the hang argument is set to -1 then all leaves end on one line and their labels hang down from 0. 69
    70. 70. Clustering (4) 70
    71. 71. Clustering (5) Partitioning into a number of clusters specified by the user.gr <- kmeans(cbind(disp, hp), 2)plot(disp, hp, col = gr$cluster, pch=19)points(gr$centers, col = 1:2, pch = 8, cex=2) 71
    72. 72. Clustering (6) 72
    73. 73. Clustering (7)K-means clustering with 2 clusters of sizes 18, 14Cluster means: disp hp1 135.5389 98.055562 353.1000 209.21429Clustering vector: [1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1Within cluster sum of squares by cluster:[1] 58369.27 93490.74 (between_SS / total_SS = 75.6 %) 73
    74. 74. Clustering (8)x <- as.matrix(mtcars)heatmap(x, scale="column") 74
    75. 75. Time Series (1) First, make the data variable into a time series object # create time-series objects beer <- ts(beer, start=1956, freq=12) It is useful to be able to turn a time series into components. The function stl performs seasonal decomposition of a time series into seasonal, trend and irregular components using loess. 75
    76. 76. Time Series (2) The remainder component is the residuals from the seasonal plus trend fit. The bars at the right-hand side are of equal heights (in user coordinates). # Decompose a time series into seasonal, # trend and irregular components using loess ts.comp <- stl(beer, s.window="periodic") plot(ts.comp) 76
    77. 77. Time Series (3) 77
    78. 78. Programming (1) We can extend the functionality of R by writing a function that estimates the standard error of the mean SEM <- function(x, na.rm = FALSE) { if (na.rm == TRUE) VAR <- x[!is.na(x)] else VAR <- x SD <- sd(VAR) N <- length(VAR) SE <- SD/sqrt(N - 1) return(SE) } 78
    79. 79. Programming (2) You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments. "%p%" <- function(x,y) paste(x,y,sep=" ") "Hi" %p% "Khaled" [1] "Hi Khaled" 79
    80. 80. Programming (3)setwd("path/to/folder")sink("output.txt") cat("Intercept t Slope") a <- fit$coefficients[[1]] b <- fit$coefficients[[2]] cat(paste(a, b, sep="t"))sink()jpeg(filename="graph.jpg", width=600, height=600)plot(wt, mpg); abline(fit)dev.off() 80
    81. 81. Programming (4) The code for R functions can be viewed, and in most cases modified, if so is desired using fix() function. You can trigger garbage collection by call gc() function which will report few memory usage statistics. Basic tool for code timing is: system.time(commands) tempfile() give a unique file name in temporary writable directory deleted at the end of the session. 81
    82. 82. Programming (5) Take control of your R code! RStudio is a free and open source integrated development environment for R. You can run it on your desktop (Windows, Mac, or Linux) :  Syntax highlighting, code completion, etc...  Execute R code directly from the source editor  Workspace browser and data viewer  Plot history, zooming, and flexible image & PDF export  Integrated R help and documentation  and more (http://www.rstudio.com/ide/) 82
    83. 83. Programming (6) 83
    84. 84. Programming (7) If want to evaluate the quadratic x2−2x +4 many times so we can write a function that evaluates the function for a specific value of x: my.f <- function(x) { x^2 - 2*x + 4 } my.f(3) [1] 7 plot(my.f, -10, +10) 84
    85. 85. Programming (8) 85
    86. 86. Programming (9) We can find the minimum of the function using: optimize(my.f, lower = -10, upper = 10) $minimum [1] 1 $objective [1] 3 which says that the minimum occurs at x=1 and at that point the quadratic has value 3. 86
    87. 87. Programming (10) We can integrate the function over the interval -10 to 10 using: integrate(my.f, lower = -10, upper = 10) 746.6667 with absolute error < 4.1e-12 which gives an answer together with an estimate of the absolute error. 87
    88. 88. Programming (11)plot(my.f, -15, +15)v <- seq(-10,10,0.01)x <- c(-10,v,10)y <- c(0,my.f(v),0)polygon(x, y, col=gray) 88
    89. 89. Publication-Quality Output (1) Research doesn’t end when the last statistical analysis is completed. We need to include the results in a report. xtable function convert an R object to an xtable object, which can then be printed as a LaTeX table. LaTeX is a document preparation system for high- quality typesetting (http://www.latex-project.org).library(xtable)print(xtable(model)) 89
    90. 90. Publication-Quality Output (2)library(xtable)example(aov)print(xtable(npk.aov)) 90
    91. 91. Publication-Quality Output (3) ggplot2 package is an elegant alternative to the base graphics system, it has two complementary uses:  Producing publication quality graphics using very simple syntax that it similar to that of base graphics. ggplot2 tends to make smart default choices for color, scale etc.  Making more sophisticated/customized plots that go beyond the defaults. 91
    92. 92. Publication-Quality Output (4) 92
    93. 93. Final words! How Large is Your Family? How many brothers and sisters are there in your family including yourself? The average number of children in families was about 2. Can you explain the difference between this value and the class average? Birthday Problem! The problem is to compute the approximate probability that in a room of n people, at least two have the same birthday. 93
    94. 94. Online Resources http://tryr.codeschool.com http://www.r-project.org http://www.statmethods.net http://www.r-bloggers.com http://www.r-tutor.com http://blog.revolutionanalytics.com/r 94
    95. 95. Thank You 95