Your SlideShare is downloading. ×
0
BASICS OF DATA MUNGING IN R      &
DATA MUNGING
OUR GOAL••    –    –    –    –    –
SOURCES:LET’S GET DATA
DATA LOOKS LIKE THIS
TEXT INTO R       > read.table()•      > read.csv()       # Read a table       > url <-       http://robjhyndman.com/tsdld...
PLAIN TEXT
DATA LOOKS LIKE THIS
MARKUP INTO R                                                        •> library(XML)> url2 <-‘http://www.faa.gov/data_rese...
HTML
FROM OTHER LANGUAGES> library(xlsx)> library(foreign)                      •                      •# SAS> read.xport(file)...
CUSTOM PACKAGES              >   library(quantmod)•             >   library(twitteR)              >   library(RNYTimes)   ...
BUILDING ANDEXPLORING DATA
TYPES OF DATA•           >            >                a                b                    <-                    <-     ...
CONVERSION AND COERCION              > as.data.frame(a)•               a              1 1    –         2 2              3 ...
VARIABLE INTERROGATION                 > Y <- runif(200)•                > str(Y)                 num [1:200] 0.5053 0.356...
INDEXESA[m,n]A[ ,n]
USING INDEXES•                                                       > b                                                  ...
USE NAMES INSTEAD: $                                           > X <- 1•                                          > X$name...
MORE STRUCTUREcbind(a,b)rbind(a,b)
TRANSFORMING DATA
SPEAK LIKE A NATIVE•                  > mymatrix <-                   matrix(rep(seq(2,6,by=2), 3),                   ncol...
LAPPLY•        > lapply(mymatrix[,1],sum)         [[1]]         [1] 2         [[2]]•        [1] 4         [[3]]         [1...
LONG TO WIDELanguage Skill Users1        R High     102        R   Med     103        R   Low     104      SAS High       ...
RESHAPE GYMNASTICS                                         > df <-•                                        data.frame(c("R...
NEW VARIABLES IN-PLACE                 > head(mtcars)[1]•                Mazda RX4                                    mpg ...
MASH UP> head(unemp)      region rank Aug. 2012 Sept. 201 change DEV state_code State Abbreviation1    alabama   14       ...
THANKS••
Upcoming SlideShare
Loading in...5
×

Data Munging in R - Chicago R User Group

3,875

Published on

It was beginner's night at the Chicago R User Group. I presented on the basics of Data Munging---the skill of obtaining data and reworking it into more useful formats. This is a huge topic you can't possibly do justice to in a single presentation, but was intended to give the audience an overview to use help() for in the future, and motivate some R explorations

From Adam Hogan at Design & Analytics

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,875
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
70
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Data Munging in R - Chicago R User Group"

  1. 1. BASICS OF DATA MUNGING IN R &
  2. 2. DATA MUNGING
  3. 3. OUR GOAL•• – – – – –
  4. 4. SOURCES:LET’S GET DATA
  5. 5. DATA LOOKS LIKE THIS
  6. 6. TEXT INTO R > read.table()• > read.csv() # Read a table > url <- http://robjhyndman.com/tsdldata/robert s/beards.dat > read.table(url, header=FALSE,skip=4) V1 1 20• 2 24 3 10 4 21 5 28 … … # Read a CSV file > Y <- read.csv(filename, header=F)
  7. 7. PLAIN TEXT
  8. 8. DATA LOOKS LIKE THIS
  9. 9. MARKUP INTO R •> library(XML)> url2 <-‘http://www.faa.gov/data_research/passengers_cargo/unruly_passengers/’> X <- readHTMLTable(url2, header=T,stringsAsFactors=FALSE)[[1]]> X Year Total •1 1995 1462 1996 1843 1997 2354 19985 1999 200 226 –6 2000 2277 2001 3008 2002 3069 2003 302 –10 2004 33011 200512 2006 226 156 –13 200714 2008 176 134 –15 2009 17616 2010 14817 2011 13118 2012 12 as of April 10, 2012 –
  10. 10. HTML
  11. 11. FROM OTHER LANGUAGES> library(xlsx)> library(foreign) • •# SAS> read.xport(file)# Stata> read.dta(file) •# SPSS> read.spss(file) – –# Matlab> read.octave(file) – –# minitab> read.mtp(file) – – –
  12. 12. CUSTOM PACKAGES > library(quantmod)• > library(twitteR) > library(RNYTimes) > library(RClimate) > getSymbols("GOOG") [1] "GOOG"• > searchTwitter(#ilovestatistics – ,n=10)[[2]] [1]"Statistics: the best kind of homework #ilovestatistics #nerd #shouldhavebeenastatistician – #gradschoolproblems"
  13. 13. BUILDING ANDEXPLORING DATA
  14. 14. TYPES OF DATA• > > a b <- <- c(1,2,3,4) matrix(c(1,2,3,4), nrow=2) – > c <- list("a"="fred", "b"="bill") – > d <- data.frame(b) – > a # VECTOR – [1] 1 2 3 4• > b # MATRIX [,1] [,2] [1,] 1 3 [2,] 2 4 > c # LIST (Note the key-value structure)• $a [1] "fred" $b [1] "bill" – > d # DATA FRAME X1 X2 1 1 3 2 2 4
  15. 15. CONVERSION AND COERCION > as.data.frame(a)• a 1 1 – 2 2 3 3 4 4 > data.frame(a) a – 1 1 2 2 3 3 • 4 4 > as.matrix(a, nrow=2) # WATCH OUT • [,1] [1,] 1 [2,] 2 – [3,] 3 [4,] 4 > matrix(a, nrow=2) # THIS INSTEAD! [,1] [,2] [1,] 1 3 [2,] 2 4
  16. 16. VARIABLE INTERROGATION > Y <- runif(200)• > str(Y) num [1:200] 0.5053 0.3564 0.0359 0.7377 0.0302 ... > head(Y) # GIVE ME THE FIRST 5 [1] 0.50525553 0.35636648 0.03589792• 0.73766891 0.03020607 0.50628327 – > tail(Y) # GIVE ME THE LAST 5 [1] 0.6612501 0.9930194 0.8392855 – 0.5459498 0.2587155 0.3704778 – > dim(Y) # NOPE! HE IS A VECTOR NULL – > length(Y) [1] 200
  17. 17. INDEXESA[m,n]A[ ,n]
  18. 18. USING INDEXES• > b [,1] [,2]• [1,] 1 3• [2,] 2 4 > b[,1] # ALL ROWS, FIRST COLUMN• [1] 1 2• > b[2,] # SECOND ROW, ALL COLUMNS [1] 2 4 > head(unemp) rank region Aug. 2012 Sept. 201 change > b[-1,] # ALL ROWS EXCEPT 1 14 14 alabama 8.5 8.3 -0.2 [1] 2 4 15 14 alaska 7.7 7.5 -0.2 27 27 arizona 8.3 8.2 -0.1 16 14 arkansas 7.3 7.1 -0.2 > b>3 2 2 california 10.6 10.2 -0.4 [,1] [,2] 17 14 colorado 8.2 8.0 -0.2 [1,] FALSE FALSE [2,] FALSE TRUE > unemp[unemp[4]>10,] rank region Aug. 2012 Sept. 201 change 2 2 california 10.6 10.2 -0.4 > GOOG[GOOG[,6]>768.00,6] 11 6 nevada 12.1 11.8 -0.3 GOOG.Adjusted 24 14 rhode island 10.7 10.5 -0.2 2012-10-04 768.05
  19. 19. USE NAMES INSTEAD: $ > X <- 1• > X$name <- "Fred" Warning message: In X$name <- "Fred" : Coercing LHS to a list > X$occupation <- "Doctor" > X$age <- 21 –> name <- c("Fred", "Bill") > X> occupation <- c("Doctor", "Dancer") [[1]]> people <- data.frame(name, occupation) [1] 1> people $name name occupation [1] "Fred"1 Fred Doctor2 Bill Dancer $occupation> people$age <- 35 [1] "Doctor"> people name occupation age $age1 Fred Doctor 35 [1] 212 Bill Dancer 35people[people$name=="Fred",]$age=40 > X$name == X[2]
  20. 20. MORE STRUCTUREcbind(a,b)rbind(a,b)
  21. 21. TRANSFORMING DATA
  22. 22. SPEAK LIKE A NATIVE• > mymatrix <- matrix(rep(seq(2,6,by=2), 3), ncol = 3) – • > mymatrix [,1] [,2] [,3] – [1,] 2 2 2 • [2,] 4 4 4 [3,] 6 6 6 – • > apply(mymatrix, 1, sum) [1] 6 12 18 – • > apply(mymatrix, 2, sum) [1] 12 12 12
  23. 23. LAPPLY• > lapply(mymatrix[,1],sum) [[1]] [1] 2 [[2]]• [1] 4 [[3]] [1] 6 – – > sapply(mymatrix[,1],sum) – [1] 2 4 6
  24. 24. LONG TO WIDELanguage Skill Users1 R High 102 R Med 103 R Low 104 SAS High 15 SAS Med 256 SAS Low 20Language Users.High Users.Med Users.Low1 R 10 10 104 SAS 1 25 20
  25. 25. RESHAPE GYMNASTICS > df <-• data.frame(c("R","R","R","SAS","SAS","SAS"), c("High","Med","Low","High","Med","Low"), c(10,5,10,1,25,20)); colnames(df) <- – c("Language","Skill","Users") > df Language Skill Users 1 R High 10 2 R Med 10 – 3 R Low 10 4 SAS High 1• 5 6 SAS SAS Med Low 25 20 > reshape(df, idvar="Language", timevar="Skill", direction="wide") > reshape(df2, direction="long") Language Users.High Users.Med Users.Low Language Skill Users.High R.High R High 10 1 R 10 10 10 SAS.High SAS High 1 4 SAS 1 25 20 R.Med R Med 10 SAS.Med SAS Med 25 R.Low R Low 10 SAS.Low SAS Low 20 > df3[order(df3$Language),]
  26. 26. NEW VARIABLES IN-PLACE > head(mtcars)[1]• Mazda RX4 mpg 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 – Hornet 4 Drive 21.4 Hornet Sportabout 18.7 Valiant 18.1 – > head(with(mtcars, mpg*10)) # NEW VECTOR• [1] 210 210 228 214 187 181 – > head(transform(mtcars, electricdreams=mpg*10))[c(1,12)] mpg electricdreams – Mazda RX4 21.0 210 Mazda RX4 Wag 21.0 210 – Datsun 710 22.8 228 Hornet 4 Drive 21.4 214 Hornet Sportabout 18.7 187 Valiant 18.1 181
  27. 27. MASH UP> head(unemp) region rank Aug. 2012 Sept. 201 change DEV state_code State Abbreviation1 alabama 14 8.5 8.3 -0.2 0.4 01 AL2 alaska 14 7.7 7.5 -0.2 -0.4 02 AK3 arizona 27 8.3 8.2 -0.1 0.3 04 AZ4 arkansas 14 7.3 7.1 -0.2 -0.8 05 AR5 california 2 10.6 10.2 -0.4 2.3 06 CA6 colorado 14 8.2 8.0 -0.2 0.1 08 CO
  28. 28. THANKS••
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×