03 extensions
Upcoming SlideShare
Loading in...5
×
 

03 extensions

on

  • 1,462 views

 

Statistics

Views

Total Views
1,462
Views on SlideShare
1,462
Embed Views
0

Actions

Likes
0
Downloads
36
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

03 extensions 03 extensions Presentation Transcript

  • Stat405 Graphical extensions & missing values Hadley Wickham Tuesday, 31 August 2010
  • 1. Graphical extensions (1d & 2d) 2. Subsetting Tuesday, 31 August 2010
  • 1d extensions Tuesday, 31 August 2010
  • Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 What makes it 5000 difficult to compare the 4000 distributions? 3000 2000 Brainstorm for 1 minute. 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area Tuesday, 31 August 2010
  • # Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Tuesday, 31 August 2010
  • Your turn Use this technique to explore the relationship between price and clarity, and carat and clarity. Tuesday, 31 August 2010
  • 2d extensions Tuesday, 31 August 2010
  • Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Tuesday, 31 August 2010
  • # There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This how you get help about a specific geom: ?geom_smooth # or go to http://had.co.nz/ggplot2/geom_smooth.html Tuesday, 31 August 2010
  • # To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: varying alpha qplot(price, carat, data = diamonds, alpha = I(1/10)) qplot(price, carat, data = diamonds, alpha = I(1/50)) qplot(price, carat, data = diamonds, alpha = I(1/100)) qplot(price, carat, data = diamonds, alpha = I(1/250)) Tuesday, 31 August 2010
  • Your turn Explore the relationship between carat, price and clarity, using these techniques. Which did you find most useful? Tuesday, 31 August 2010
  • Subsetting Tuesday, 31 August 2010
  • Motivation Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Tuesday, 31 August 2010
  • Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Tuesday, 31 August 2010
  • Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time. Basic structure: df$varname df[row index, column index] Tuesday, 31 August 2010
  • $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Tuesday, 31 August 2010
  • blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 31 August 2010
  • Integer subsetting Tuesday, 31 August 2010
  • # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Two positive integers in rows & columns diamonds[1:10, 1:4] # Repeating input repeats output diamonds[c(1,1,1,2,2), 1:4] # Negative integers drop values diamonds[-(1:53900), -1] Tuesday, 31 August 2010
  • # Useful technique: Order by one or more columns diamonds <- diamonds[order(diamonds$price), ] # Useful technique: Combine two tables carats <- data.frame(table(carat = diamonds$carat)) mtch <- match(diamonds$carat, carats$carat) diamonds$carat_count <- carats$Freq[mtch] Tuesday, 31 August 2010
  • Logical subsetting Tuesday, 31 August 2010
  • # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) sum(x_big) mean(x_big) table(x_big) diamonds$x[x_big] diamonds[x_big, ] Tuesday, 31 August 2010
  • small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% a # Boolean operators: & | ! b small <- diamonds$carat < 1 & a | b diamonds$price > 500 a & b lowqual <- diamonds$colour == "D" | a & !b diamonds$cut == "Fair" xor(a, b) Tuesday, 31 August 2010
  • Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Tuesday, 31 August 2010
  • Your turn Select the diamonds that have: Equal x and y dimensions. Depth between 55 and 70. Carat smaller than the mean. Cost more than $10,000 per carat. Are of good quality or better. Tuesday, 31 August 2010
  • Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Tuesday, 31 August 2010
  • diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Tuesday, 31 August 2010
  • Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Tuesday, 31 August 2010
  • equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Tuesday, 31 August 2010