Stat405
     Graphical extensions & missing values



                          Hadley Wickham
Tuesday, 31 August 2010
1. Graphical extensions (1d & 2d)
                2. Subsetting




Tuesday, 31 August 2010
1d
                 extensions
Tuesday, 31 August 2010
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

      ...
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

      ...
Problems

                    Each histogram far away from the others,
                    but we know stacking is hard to...
# Large distances make comparisons hard
     qplot(price, data = diamonds, binwidth = 500) +
       facet_wrap(~ cut)

   ...
Your turn


                    Use this technique to explore the
                    relationship between price and clari...
2d
                 extensions
Tuesday, 31 August 2010
Idea             ggplot
                     Small points       shape = I(".")

                   Transparency        alp...
# There are two ways to add additional geoms
     # 1) A vector of geom names:
     qplot(price, carat, data = diamonds,
 ...
# To set aesthetics to a particular value, you need
     # to wrap that value in I()

     qplot(price, carat, data = diam...
Your turn

                    Explore the relationship between carat,
                    price and clarity, using these ...
Subsetting

Tuesday, 31 August 2010
Motivation

                    Look at histograms and scatterplots of x,
                    y, z from the diamonds datas...
Plots

                qplot(x,   data = diamonds, binwidth = 0.1)
                qplot(y,   data = diamonds, binwidth = ...
Modifying data
                    To modify, must first know how to extract,
                    or subset. Many different...
$

                    Remember str(diamonds) ?
                    That hints at how to extract individual
              ...
blank     include all


                          integer   +ve: include
                                    -ve: exclude
...
Integer subsetting



Tuesday, 31 August 2010
# Nothing
     str(diamonds[, ])

     # Positive integers & nothing
     diamonds[1:6, ] # same as head(diamonds)
     di...
# Useful technique: Order by one or more columns
     diamonds <- diamonds[order(diamonds$price), ]

     # Useful techniq...
Logical subsetting



Tuesday, 31 August 2010
# The most complicated to understand, but
     # the most powerful. Lets you extract a
     # subset defined by some chara...
small <- diamonds[diamonds$carat < 1, ]
     lowqual <- diamonds[diamonds$clarity
       %in% c("I1", "SI2", "SI1"), ]

  ...
Useful                table(zeros)
         functions for                sum(zeros)
       logical vectors                ...
Your turn
                    Select the diamonds that have:
                    Equal x and y dimensions.
               ...
Saving results
                # Prints to screen
                diamonds[diamonds$x > 10, ]

                # Saves to ...
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Tuesday, 3...
Your turn
                    Create a logical vector that selects
                    diamonds with equal x & y. Create a...
equal_dim <- diamonds$x == diamonds$y
     equal <- diamonds[equal_dim, ]

     y_big <- diamonds$y > 10
     z_big <- dia...
Upcoming SlideShare
Loading in...5
×

03 extensions

1,154

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,154
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

03 extensions

  1. 1. Stat405 Graphical extensions & missing values Hadley Wickham Tuesday, 31 August 2010
  2. 2. 1. Graphical extensions (1d & 2d) 2. Subsetting Tuesday, 31 August 2010
  3. 3. 1d extensions Tuesday, 31 August 2010
  4. 4. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  5. 5. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 What makes it 5000 difficult to compare the 4000 distributions? 3000 2000 Brainstorm for 1 minute. 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  6. 6. Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area Tuesday, 31 August 2010
  7. 7. # Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Tuesday, 31 August 2010
  8. 8. Your turn Use this technique to explore the relationship between price and clarity, and carat and clarity. Tuesday, 31 August 2010
  9. 9. 2d extensions Tuesday, 31 August 2010
  10. 10. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Tuesday, 31 August 2010
  11. 11. # There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This how you get help about a specific geom: ?geom_smooth # or go to http://had.co.nz/ggplot2/geom_smooth.html Tuesday, 31 August 2010
  12. 12. # To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: varying alpha qplot(price, carat, data = diamonds, alpha = I(1/10)) qplot(price, carat, data = diamonds, alpha = I(1/50)) qplot(price, carat, data = diamonds, alpha = I(1/100)) qplot(price, carat, data = diamonds, alpha = I(1/250)) Tuesday, 31 August 2010
  13. 13. Your turn Explore the relationship between carat, price and clarity, using these techniques. Which did you find most useful? Tuesday, 31 August 2010
  14. 14. Subsetting Tuesday, 31 August 2010
  15. 15. Motivation Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Tuesday, 31 August 2010
  16. 16. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Tuesday, 31 August 2010
  17. 17. Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time. Basic structure: df$varname df[row index, column index] Tuesday, 31 August 2010
  18. 18. $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Tuesday, 31 August 2010
  19. 19. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 31 August 2010
  20. 20. Integer subsetting Tuesday, 31 August 2010
  21. 21. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Two positive integers in rows & columns diamonds[1:10, 1:4] # Repeating input repeats output diamonds[c(1,1,1,2,2), 1:4] # Negative integers drop values diamonds[-(1:53900), -1] Tuesday, 31 August 2010
  22. 22. # Useful technique: Order by one or more columns diamonds <- diamonds[order(diamonds$price), ] # Useful technique: Combine two tables carats <- data.frame(table(carat = diamonds$carat)) mtch <- match(diamonds$carat, carats$carat) diamonds$carat_count <- carats$Freq[mtch] Tuesday, 31 August 2010
  23. 23. Logical subsetting Tuesday, 31 August 2010
  24. 24. # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) sum(x_big) mean(x_big) table(x_big) diamonds$x[x_big] diamonds[x_big, ] Tuesday, 31 August 2010
  25. 25. small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% a # Boolean operators: & | ! b small <- diamonds$carat < 1 & a | b diamonds$price > 500 a & b lowqual <- diamonds$colour == "D" | a & !b diamonds$cut == "Fair" xor(a, b) Tuesday, 31 August 2010
  26. 26. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Tuesday, 31 August 2010
  27. 27. Your turn Select the diamonds that have: Equal x and y dimensions. Depth between 55 and 70. Carat smaller than the mean. Cost more than $10,000 per carat. Are of good quality or better. Tuesday, 31 August 2010
  28. 28. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Tuesday, 31 August 2010
  29. 29. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Tuesday, 31 August 2010
  30. 30. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Tuesday, 31 August 2010
  31. 31. equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Tuesday, 31 August 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×