Upcoming SlideShare
×

# Data manipulation on r

610 views

Published on

Manipulating data with E

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
610
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
17
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data manipulation on r

1. 1. Data Manipulation on R Factor Manipulations,subset,sorting and Reshape Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
2. 2. Basic Manipulating Data So far , we've covered how to read in data from various ways like from files, internet and databases and reading various formats of files. This session we are interested to manipulate data after reading in the file for easy data processing. 2/35
3. 3. Sorting and Ordering data sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments.' x <- c(1,5,7,8,3,12,34,2) sort(x) ## [1] 1 2 3 5 7 8 12 34 order(x) ## [1] 1 8 5 2 3 4 6 7 3/35
4. 4. Some examples of sorting and ordering # sort by mpg newdata <- mtcars[order(mpg),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 # sort by mpg and cyl newdata <- mtcars[order(mpg, cyl),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 4/35
5. 5. Ordering with plyr library(plyr) head(arrange(mtcars,mpg),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 head(arrange(mtcars,desc(mpg)),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 5/35
6. 6. Subsetting data set.seed(12345) #create a dataframe X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) # Add NA VALUES X<-X[sample(1:10),];X\$B[c(1,6,10)]=NA head(X) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 10 5 NA 26 6/35
7. 7. Basic data subsetting # Accessing only first row X[1,] ## A B C ## 8 4 NA 27 # accessing only first column X[,1] ## [1] 4 8 10 3 7 5 9 1 2 6 # accessing first row and first column X[1,1] ## [1] 4 7/35
8. 8. And/OR's head(X[(X\$A <=6 & X\$C > 24),],3) ## A B C ## 8 4 NA 27 ## 10 5 NA 26 ## 7 2 19 29 head(X[(X\$A <=6 | X\$C > 24),],3) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 5 3 13 24 8/35
9. 9. select Non NA values Data Frame # select the dataframe without NA values in B column head(X[which(X\$B!='NA'),],4) ## A B C ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 # select those which have values > 14 head(X[which(X\$B>11),],4) ## A B C ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 4 9 20 30 9/35
10. 10. # creating a data frame with 2 variables data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) list_data<-list(dat=data,vec.obj=c(1,2,3)) list_data ## \$dat ## x1 x2 ## 1 2 5 ## 2 3 6 ## 3 4 7 ## 4 5 8 ## 5 6 1 ## ## \$vec.obj ## [1] 1 2 3 # accessing second element of the list_obj objects list_data[[2]] ## [1] 1 2 3 10/35
11. 11. Factors Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and coercion functions for these classes. 11/35
12. 12. Factors Suppose we have a vector of case-control status cc=factor(c("case","case","case","control","control","control")) cc ## [1] case case case control control control ## Levels: case control levels(cc)=c("control","case") cc ## [1] control control control case case case ## Levels: control case 12/35
13. 13. Factors Factors can be converted to numericor charactervery easily x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) as.character(x) ## [1] "case" "case" "case" "control" "control" "control" as.numeric(x) ## [1] 2 2 2 1 1 1 13/35
14. 14. Cut Now that we know more about factors, cut()will make more sense: x=1:100 cx=cut(x,breaks=c(0,10,25,50,100)) head(cx) ## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] ## Levels: (0,10] (10,25] (25,50] (50,100] table(cx) ## cx ## (0,10] (10,25] (25,50] (50,100] ## 10 15 25 50 14/35
15. 15. Cut We can also leave off the labels cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) head(cx) ## [1] 1 1 1 1 1 1 table(cx) ## cx ## 1 2 3 4 ## 10 15 25 50 15/35
16. 16. Cut cx=cut(x,breaks=c(10,25,50),labels=FALSE) head(cx) ## [1] NA NA NA NA NA NA table(cx) ## cx ## 1 2 ## 15 25 table(cx,useNA="ifany") ## cx ## 1 2 <NA> ## 15 25 60 16/35
17. 17. Adding to data frames m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) m1 ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) m2 ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 17/35
18. 18. Adding using cbind You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) cbind(m1,m2) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 1 2 3 ## [2,] 2 5 8 4 5 6 ## [3,] 3 6 9 7 8 9 18/35
19. 19. Reshape data Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping Data with the reshape Package paper, by Wickham, which can be found at (http://www.jstatsoft.org/v21/i12/paper)) 19/35
20. 20. Wide data has a column for each variable. For example, this is wide-format data: # ozone wind temp # 1 23.62 11.623 65.55 # 2 29.44 10.267 79.10 # 3 59.12 8.942 83.90 # 4 59.96 8.794 83.97 Data in long format # variable value # 1 ozone 23.615 # 2 ozone 29.444 # 3 ozone 59.115 # 4 ozone 59.962 # 5 wind 11.623 # 6 wind 10.267 # 7 wind 8.942 # 8 wind 8.794 # 9 temp 65.548 # 10 temp 79.100 # 11 temp 83.903 # 12 temp 83.968 20/35
21. 21. reshape 2 Package "In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data. But people often find it easier to record their data in wide format." reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into long-format data. cast takes long-format data and casts it into wide-format data. 21/35
22. 22. Melt library(reshape2) head(airquality,2) ## ozone solar.r wind temp month day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 aql <- melt(airquality) # [a]ir [q]uality [l]ong format head(aql,5) ## variable value ## 1 ozone 41 ## 2 ozone 36 ## 3 ozone 12 ## 4 ozone 18 ## 5 ozone NA 22/35
23. 23. By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify individual rows of data. m <- melt(airquality, id.vars = c("month", "day")) head(m,4) ## month day variable value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 23/35
24. 24. Melt also allow us to control the column names in long data format m <- melt(airquality, id.vars = c("month", "day"), variable.name = "climate_variable", value.name = "climate_value") head(m) ## month day climate_variable climate_value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 ## 5 5 5 ozone NA ## 6 5 6 ozone 28 24/35
25. 25. Long- to wide-format data: the cast functions In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula to describe the shape of the data. m <- melt(airquality, id.vars = c("month", "day")) aqw <- dcast(m, month + day ~ variable) head(aqw) ## month day ozone solar.r wind temp ## 1 5 1 41 190 7.4 67 ## 2 5 2 36 118 8.0 72 ## 3 5 3 12 149 12.6 74 ## 4 5 4 18 313 11.5 62 ## 5 5 5 NA NA 14.3 56 ## 6 5 6 28 NA 14.9 66 Here, we need to tell dcast that month and day are the ID variables. Besides re-arranging the columns, we’ve recovered our original data. 25/35
26. 26. Data Manipulation Using plyr For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a single output again. This type of split using default R is not much efficient, and to overcome this limitation, Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. In the coming slides i will give example of the split-apply-combine strategy using · Without Loops · With Loops · Using plyr package 26/35
27. 27. Without loops I am using the iris dataset here 1. Split the iris dataset into three parts. 2. Remove the species name variable from the data. 3. Calculate the mean of each variable for the three different parts separately. 4. Combine the output into a single data frame. iris.set <- iris[iris\$Species=="setosa",-5] iris.versi <- iris[iris\$Species=="versicolor",-5] iris.virg <- iris[iris\$Species=="virginica",-5] # calculating mean for each piece (The apply step) mean.set <- colMeans(iris.set) mean.versi <- colMeans(iris.versi) mean.virg <- colMeans(iris.virg) # combining the output (The combine step) mean.iris <- rbind(mean.set,mean.versi,mean.virg) # giving row names so that the output could be easily understood rownames(mean.iris) <- c("setosa","versicolor","virginica") 27/35
28. 28. With Loops mean.iris.loop <- NULL for(species in unique(iris\$Species)) { iris_sub <- iris[iris\$Species==species,] column_means <- colMeans(iris_sub[,-5]) mean.iris.loop <- rbind(mean.iris.loop,column_means) } # giving row names so that the output could be easily understood rownames(mean.iris.loop) <- unique(iris\$Species) NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy wont work if one piece is dependent upon one another. 28/35
29. 29. Using plyr library (plyr) ddply(iris,~Species,function(x) colMeans(x[,- which(colnames(x)=="Species")])) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 mean.iris.loop ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## setosa 5.006 3.428 1.462 0.246 ## versicolor 5.936 2.770 4.260 1.326 ## virginica 6.588 2.974 5.552 2.026 29/35
30. 30. Merging data frames # Make a data frame mapping story numbers to titles stories <- read.table(header=T, text=' storyid title 1 lions 2 tigers 3 bears ') # Make another data frame with the data and story numbers (no titles) data <- read.table(header=T, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2 ') 30/35
31. 31. Merge the two data frames merge(stories, data, "storyid") ## storyid title subject rating ## 1 1 lions 1 6.7 ## 2 1 lions 2 5.2 ## 3 2 tigers 1 4.5 ## 4 2 tigers 2 3.3 ## 5 3 bears 1 3.7 ## 6 3 bears 2 4.1 If the two data frames have different names for the columns you want to match on, the names can be specified: # In this case, the column is named 'id' instead of storyid stories2 <- read.table(header=T, text=' id title 1 lions 2 tigers 3 bears ') merge(x=stories2, y=data, by.x="id", by.y="storyid") 31/35
32. 32. Resources and Materials used · Data Manipulation with R by Phil Spector · Getting and Cleaning data Coursera Course · plyr by Hadley Wickham · Andrew Jaffe Notes · R cookbok 32/35