SlideShare a Scribd company logo
1 of 30
Download to read offline
Stat405   Subsetting & shortcuts


                            Hadley Wickham
Tuesday, 7 September 2010
Roadmap
                   • Lectures 1-3: basic graphics
                   • Lectures 4-6: basic data handling
                   • Lectures 7-9: basic functions


                   • The absolutely most essential tools. Rest
                     of course is building your vocab, and
                     learning how to use them all together.


Tuesday, 7 September 2010
1. Character subsetting
               2. Sorting
               3. Shortcuts
               4. Iteration
               5. (Optional extra: command line tips)



Tuesday, 7 September 2010
Subsetting

Tuesday, 7 September 2010
Your turn

                   In pairs, try and recall the five types of
                   subsetting we talked about last week.
                   You have one minute!




Tuesday, 7 September 2010
blank     include all


                            integer   +ve: include
                                      -ve: exclude

                            logical   include TRUEs


                            character lookup by name


Tuesday, 7 September 2010
# Matches by names
     diamonds[1:5, c("carat", "cut", "color")]

     # Useful technique: change labelling
     c("Fair" = "C", "Good" = "B", "Very Good" = "B+",
     "Premium" = "A", "Ideal" = "A+")[diamonds$cut]

     # Can also be used to collapse levels
     table(c("Fair" = "C", "Good" = "B", "Very Good" =
     "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

     # (see ?cut for continuous to discrete equivalent)



Tuesday, 7 September 2010
Sorting a data frame
               x <- c(2, 4, 3, 1)
               order(x)
               # means: to get x in order, put 4th in
               # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th
               x[order(x)]

               # What does this do?
               diamonds[order(diamonds$price), ]


Tuesday, 7 September 2010
# Order by x, then y, then z
     order(diamonds$x, diamonds$y, diamonds$y)

     # Put in order of quality
     order(diamonds$color, desc(diamonds$cut),
       desc(diamonds$clarity))

     # desc sorts in descending order
     # also found in the plyr package
     x[order(x)]
     x[order(desc(x))]



Tuesday, 7 September 2010
Your turn
                   Reorder the mpg dataset from most to
                   least efficient.
                   The fl variable gives the type of fuel (r =
                   regular, d = diesel, p = premium, c = cng,
                   e = ethanol). Modify fl to spell out the
                   fuel type explicitly, collapsing c, d, and e
                   in a single other category.


Tuesday, 7 September 2010
Short cuts

Tuesday, 7 September 2010
Short cuts
                   You’ve been typing diamonds many many
                   times. These following shortcut save
                   typing, but may be a little harder to
                   understand, and will not work in some
                   situations. (Don’t forget the basics!)
                   Four specific to data frames, one more
                   generic.


Tuesday, 7 September 2010
Function                Package

                                  subset                    base

                                summarise                   plyr

                                transform                   base

                                 arrange                    plyr
                            plyr is loaded automatically with ggplot2, or
                            load it explicitly with library(plyr).
                            base always automatically loaded
Tuesday, 7 September 2010
# subset: short cut for subsetting
     zero_dim <- diamonds$x == 0 | diamonds$y == 0 |
       diamonds$z == 0
     diamonds[zero_dim & !is.na(zero_dim), ]

     subset(diamonds, x == 0 | y == 0 | z == 0)

     # summarise/summarize: short cut for creating summary
     biggest <- data.frame(
       price.max = max(diamonds$price),
       carat.max = max(diamonds$carat))

     biggest <- summarise(diamonds,
       price.max = max(price),
       carat.max = max(carat))

Tuesday, 7 September 2010
# transform: short cut for adding new variables
     diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
     diamonds$density <- diamonds$volume / diamonds$carat

     diamonds <- transform(diamonds, volume = x * y * z)
     diamonds <- transform(diamonds,
       density = volume / carat)

     # arrange: short cut for reordering
     diamonds <- diamonds[order(diamonds$price,
       desc(diamonds$carat)), ]

     diamonds <- arrange(diamonds, price, desc(carat))

Tuesday, 7 September 2010
#     They all have similar syntax. The first argument
     #     is a data frame, and all other arguments are
     #     interpreted in the context of that data frame
     #     (so you don't need to use data$ all the time)

     subset(df, subset)
     transform(df, var1 = expr1, ...)
     summarise(df, var1 = expr1, ...)
     arrange(df, var1, ...)

     # They all return a modified data frame. You still
     # have to save that to a variable if you want to
     # keep it


Tuesday, 7 September 2010
Your turn
                   Use summarise, transform, subset and arrange
                   to:
                   Find all diamonds bigger than 3 carats and
                   order from most expensive to cheapest.
                   Add a new variable that estimates the
                   diameter of the diamond (average of x and y).
                   Compute depth (z / diameter * 100) yourself.
                   How does it compare to the depth in the data?


Tuesday, 7 September 2010
Aside:
                            never use attach!
                   Non-local effects; not symmetric; implicit,
                   not explicit.
                   Makes it very easy to make mistakes.
                   Use with() instead:
                   with(bnames, table(year, length))



Tuesday, 7 September 2010
# with is more general. Use in concert with other
     # functions, particularly those that don't have a data
     # argument

     diamonds$volume <- with(diamonds, x * y * z)

     # This won't work:
     with(diamonds, volume <- x * y * z)
     # with only changes lookup, not assignment




Tuesday, 7 September 2010
Iteration

Tuesday, 7 September 2010
Stories
                   Best data analyses tell a story, with a
                   natural flow from beginning to end.
                   For homeworks, try and come up with
                   three plots that tell a story.
                   Stories about a small sample of the data
                   can work well.



Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     qplot(x, z, data = diamonds)

     # Start by fixing incorrect values

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0

     diamonds$x[x_zero] <- NA
     diamonds$y[y_zero | y_big] <- NA
     diamonds$z[z_zero | z_big] <- NA
Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     # How can I get rid of those outliers?

     qplot(x, x - y, data = diamonds)
     qplot(x - y, data = diamonds, binwidth = 0.01)
     last_plot() + xlim(-0.5, 0.5)
     last_plot() + xlim(-0.2, 0.2)

     asym <- abs(diamonds$x - diamonds$y) > 0.2
     diamonds_sym <- diamonds[!asym, ]

     # Did it work?
     qplot(x, y, data = diamonds_sym)
     qplot(x, x - y, data = diamonds_sym)
     # Something interesting is going on there!
     qplot(x, x - y, data = diamonds_sym,
       geom = "bin2d", binwidth = c(0.1, 0.01))

Tuesday, 7 September 2010
# What about x and z?
     qplot(x, z, data = diamonds_sym)
     qplot(x, x - z, data = diamonds_sym)
     # Subtracting doesn't work - z smaller than x and y
     qplot(x, x / z, data = diamonds_sym)
     # But better to log transform to make symmetrical
     qplot(x, log10(x / z), data = diamonds_sym)

     # and so on...




Tuesday, 7 September 2010
# How does symmetry relate to price?
     qplot(abs(x - y), price, data =diamonds_sym) +
       geom_smooth()

     qplot(abs(x - y), price, data = diamonds_sym, geom =
     "boxplot", group = round(abs(x-y) * 10))

     diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x -
       diamonds_sym$y))
     qplot(sym, price, data = diamonds_sym,
       geom = "boxplot", group = sym)
     # Are asymmetric diamonds worth more?

     qplot(carat, price, data = diamonds_sym, colour = sym)
     qplot(log10(carat), log10(price), data = diamonds_sym, colour
     = sym, group = sym) + geom_smooth(method = lm, se = F)


Tuesday, 7 September 2010
# Modelling

     summary(lm(log10(price) ~ log10(carat) + sym,
       data = diamonds_sym))
     # But statistical significance != practical
     # significance

     sd(diamonds_sym$sym, na.rm = T)
     # [1] 0.02368828

     #     So       1 sd increase in sym, decreases log10(price)
     #     by       -0.01 (= 0.23 * -0.44)
     #     10       ^ -0.01 = 0.976
     #     So       1 sd increase in sym decreases price by ~2%


Tuesday, 7 September 2010
Command
                       line
Tuesday, 7 September 2010
Why?

                   Provenance & reproducibility.
                   Working with remote servers.
                   Automation & scripting.
                   Common tools.




Tuesday, 7 September 2010
Basics
                   pwd: the location of the current directory
                   ls: the files in the current directory
                   cd: change to another directory
                   cd ..: change to parent directory
                   cd ~: change to home directory
                   mkdir: create a new directory


Tuesday, 7 September 2010
Your turn
                   Create a directory for stat405.
                   Inside that directory, create a directory for
                   homework 2.
                   Confirm that there are no files in that
                   directory.
                   Navigate back to your home directory.
                   What other files are there?


Tuesday, 7 September 2010

More Related Content

What's hot

Codice legacy, usciamo dal pantano! @iad11
Codice legacy, usciamo dal pantano! @iad11Codice legacy, usciamo dal pantano! @iad11
Codice legacy, usciamo dal pantano! @iad11Stefano Leli
 
Intro to Python (High School) Unit #3
Intro to Python (High School) Unit #3Intro to Python (High School) Unit #3
Intro to Python (High School) Unit #3Jay Coskey
 
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Enplus Advisors, Inc.
 

What's hot (8)

Codice legacy, usciamo dal pantano! @iad11
Codice legacy, usciamo dal pantano! @iad11Codice legacy, usciamo dal pantano! @iad11
Codice legacy, usciamo dal pantano! @iad11
 
08 Functions
08 Functions08 Functions
08 Functions
 
14 Ddply
14 Ddply14 Ddply
14 Ddply
 
TreSQL
TreSQL TreSQL
TreSQL
 
Lec2
Lec2Lec2
Lec2
 
Intro to Python (High School) Unit #3
Intro to Python (High School) Unit #3Intro to Python (High School) Unit #3
Intro to Python (High School) Unit #3
 
Python book
Python bookPython book
Python book
 
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
 

Viewers also liked

Viewers also liked (6)

07 Problem Solving
07 Problem Solving07 Problem Solving
07 Problem Solving
 
Yet another object system for R
Yet another object system for RYet another object system for R
Yet another object system for R
 
13 case-study
13 case-study13 case-study
13 case-study
 
16 Git
16 Git16 Git
16 Git
 
03 extensions
03 extensions03 extensions
03 extensions
 
06 Data
06 Data06 Data
06 Data
 

Similar to 05 subsetting

R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for BeginnersMetamarkets
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Ggplot in python
Ggplot in pythonGgplot in python
Ggplot in pythonAjay Ohri
 
Tkinter Does Not Suck
Tkinter Does Not SuckTkinter Does Not Suck
Tkinter Does Not SuckRichard Jones
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHPIan Barber
 
Paradigmas de programação funcional + objetos no liquidificador com scala
Paradigmas de programação funcional + objetos no liquidificador com scalaParadigmas de programação funcional + objetos no liquidificador com scala
Paradigmas de programação funcional + objetos no liquidificador com scalaBruno Oliveira
 
Progressive Advancement, by way of progressive enhancement
Progressive Advancement, by way of progressive enhancementProgressive Advancement, by way of progressive enhancement
Progressive Advancement, by way of progressive enhancementPaul Irish
 

Similar to 05 subsetting (20)

07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
02 Ddply
02 Ddply02 Ddply
02 Ddply
 
06 data
06 data06 data
06 data
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
Clojure night
Clojure nightClojure night
Clojure night
 
22 spam
22 spam22 spam
22 spam
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
11 Data Structures
11 Data Structures11 Data Structures
11 Data Structures
 
Ggplot in python
Ggplot in pythonGgplot in python
Ggplot in python
 
The jQuery Divide
The jQuery DivideThe jQuery Divide
The jQuery Divide
 
Immutability
ImmutabilityImmutability
Immutability
 
Tkinter Does Not Suck
Tkinter Does Not SuckTkinter Does Not Suck
Tkinter Does Not Suck
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 
Paradigmas de programação funcional + objetos no liquidificador com scala
Paradigmas de programação funcional + objetos no liquidificador com scalaParadigmas de programação funcional + objetos no liquidificador com scala
Paradigmas de programação funcional + objetos no liquidificador com scala
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
ES6 is Nigh
ES6 is NighES6 is Nigh
ES6 is Nigh
 
Session 02
Session 02Session 02
Session 02
 
Progressive Advancement, by way of progressive enhancement
Progressive Advancement, by way of progressive enhancementProgressive Advancement, by way of progressive enhancement
Progressive Advancement, by way of progressive enhancement
 

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 

05 subsetting

  • 1. Stat405 Subsetting & shortcuts Hadley Wickham Tuesday, 7 September 2010
  • 2. Roadmap • Lectures 1-3: basic graphics • Lectures 4-6: basic data handling • Lectures 7-9: basic functions • The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together. Tuesday, 7 September 2010
  • 3. 1. Character subsetting 2. Sorting 3. Shortcuts 4. Iteration 5. (Optional extra: command line tips) Tuesday, 7 September 2010
  • 5. Your turn In pairs, try and recall the five types of subsetting we talked about last week. You have one minute! Tuesday, 7 September 2010
  • 6. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 7 September 2010
  • 7. # Matches by names diamonds[1:5, c("carat", "cut", "color")] # Useful technique: change labelling c("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut] # Can also be used to collapse levels table(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut]) # (see ?cut for continuous to discrete equivalent) Tuesday, 7 September 2010
  • 8. Sorting a data frame x <- c(2, 4, 3, 1) order(x) # means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th x[order(x)] # What does this do? diamonds[order(diamonds$price), ] Tuesday, 7 September 2010
  • 9. # Order by x, then y, then z order(diamonds$x, diamonds$y, diamonds$y) # Put in order of quality order(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity)) # desc sorts in descending order # also found in the plyr package x[order(x)] x[order(desc(x))] Tuesday, 7 September 2010
  • 10. Your turn Reorder the mpg dataset from most to least efficient. The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category. Tuesday, 7 September 2010
  • 11. Short cuts Tuesday, 7 September 2010
  • 12. Short cuts You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!) Four specific to data frames, one more generic. Tuesday, 7 September 2010
  • 13. Function Package subset base summarise plyr transform base arrange plyr plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded Tuesday, 7 September 2010
  • 14. # subset: short cut for subsetting zero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0 diamonds[zero_dim & !is.na(zero_dim), ] subset(diamonds, x == 0 | y == 0 | z == 0) # summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat)) biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat)) Tuesday, 7 September 2010
  • 15. # transform: short cut for adding new variables diamonds$volume <- diamonds$x * diamonds$y * diamonds$z diamonds$density <- diamonds$volume / diamonds$carat diamonds <- transform(diamonds, volume = x * y * z) diamonds <- transform(diamonds, density = volume / carat) # arrange: short cut for reordering diamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ] diamonds <- arrange(diamonds, price, desc(carat)) Tuesday, 7 September 2010
  • 16. # They all have similar syntax. The first argument # is a data frame, and all other arguments are # interpreted in the context of that data frame # (so you don't need to use data$ all the time) subset(df, subset) transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) arrange(df, var1, ...) # They all return a modified data frame. You still # have to save that to a variable if you want to # keep it Tuesday, 7 September 2010
  • 17. Your turn Use summarise, transform, subset and arrange to: Find all diamonds bigger than 3 carats and order from most expensive to cheapest. Add a new variable that estimates the diameter of the diamond (average of x and y). Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data? Tuesday, 7 September 2010
  • 18. Aside: never use attach! Non-local effects; not symmetric; implicit, not explicit. Makes it very easy to make mistakes. Use with() instead: with(bnames, table(year, length)) Tuesday, 7 September 2010
  • 19. # with is more general. Use in concert with other # functions, particularly those that don't have a data # argument diamonds$volume <- with(diamonds, x * y * z) # This won't work: with(diamonds, volume <- x * y * z) # with only changes lookup, not assignment Tuesday, 7 September 2010
  • 21. Stories Best data analyses tell a story, with a natural flow from beginning to end. For homeworks, try and come up with three plots that tell a story. Stories about a small sample of the data can work well. Tuesday, 7 September 2010
  • 22. qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) # Start by fixing incorrect values y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 diamonds$x[x_zero] <- NA diamonds$y[y_zero | y_big] <- NA diamonds$z[z_zero | z_big] <- NA Tuesday, 7 September 2010
  • 23. qplot(x, y, data = diamonds) # How can I get rid of those outliers? qplot(x, x - y, data = diamonds) qplot(x - y, data = diamonds, binwidth = 0.01) last_plot() + xlim(-0.5, 0.5) last_plot() + xlim(-0.2, 0.2) asym <- abs(diamonds$x - diamonds$y) > 0.2 diamonds_sym <- diamonds[!asym, ] # Did it work? qplot(x, y, data = diamonds_sym) qplot(x, x - y, data = diamonds_sym) # Something interesting is going on there! qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01)) Tuesday, 7 September 2010
  • 24. # What about x and z? qplot(x, z, data = diamonds_sym) qplot(x, x - z, data = diamonds_sym) # Subtracting doesn't work - z smaller than x and y qplot(x, x / z, data = diamonds_sym) # But better to log transform to make symmetrical qplot(x, log10(x / z), data = diamonds_sym) # and so on... Tuesday, 7 September 2010
  • 25. # How does symmetry relate to price? qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth() qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10)) diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y)) qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym) # Are asymmetric diamonds worth more? qplot(carat, price, data = diamonds_sym, colour = sym) qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F) Tuesday, 7 September 2010
  • 26. # Modelling summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym)) # But statistical significance != practical # significance sd(diamonds_sym$sym, na.rm = T) # [1] 0.02368828 # So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44) # 10 ^ -0.01 = 0.976 # So 1 sd increase in sym decreases price by ~2% Tuesday, 7 September 2010
  • 27. Command line Tuesday, 7 September 2010
  • 28. Why? Provenance & reproducibility. Working with remote servers. Automation & scripting. Common tools. Tuesday, 7 September 2010
  • 29. Basics pwd: the location of the current directory ls: the files in the current directory cd: change to another directory cd ..: change to parent directory cd ~: change to home directory mkdir: create a new directory Tuesday, 7 September 2010
  • 30. Your turn Create a directory for stat405. Inside that directory, create a directory for homework 2. Confirm that there are no files in that directory. Navigate back to your home directory. What other files are there? Tuesday, 7 September 2010