Upcoming SlideShare
×

03 extensions

1,286 views
1,239 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,286
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
38
0
Likes
0
Embeds 0
No embeds

No notes for slide

03 extensions

1. 1. Stat405 Graphical extensions & missing values Hadley Wickham Tuesday, 31 August 2010
2. 2. 1. Graphical extensions (1d & 2d) 2. Subsetting Tuesday, 31 August 2010
3. 3. 1d extensions Tuesday, 31 August 2010
4. 4. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
5. 5. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 What makes it 5000 difﬁcult to compare the 4000 distributions? 3000 2000 Brainstorm for 1 minute. 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
6. 6. Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difﬁcult → rescale to ensure constant area Tuesday, 31 August 2010
7. 7. # Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Tuesday, 31 August 2010
8. 8. Your turn Use this technique to explore the relationship between price and clarity, and carat and clarity. Tuesday, 31 August 2010
9. 9. 2d extensions Tuesday, 31 August 2010
10. 10. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Tuesday, 31 August 2010
11. 11. # There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This how you get help about a specific geom: ?geom_smooth # or go to http://had.co.nz/ggplot2/geom_smooth.html Tuesday, 31 August 2010
12. 12. # To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: varying alpha qplot(price, carat, data = diamonds, alpha = I(1/10)) qplot(price, carat, data = diamonds, alpha = I(1/50)) qplot(price, carat, data = diamonds, alpha = I(1/100)) qplot(price, carat, data = diamonds, alpha = I(1/250)) Tuesday, 31 August 2010
13. 13. Your turn Explore the relationship between carat, price and clarity, using these techniques. Which did you ﬁnd most useful? Tuesday, 31 August 2010
14. 14. Subsetting Tuesday, 31 August 2010
15. 15. Motivation Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Tuesday, 31 August 2010
16. 16. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Tuesday, 31 August 2010
17. 17. Modifying data To modify, must ﬁrst know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time. Basic structure: df\$varname df[row index, column index] Tuesday, 31 August 2010
18. 18. \$ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds\$carat diamonds\$price Tuesday, 31 August 2010
19. 19. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 31 August 2010
20. 20. Integer subsetting Tuesday, 31 August 2010
21. 21. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Two positive integers in rows & columns diamonds[1:10, 1:4] # Repeating input repeats output diamonds[c(1,1,1,2,2), 1:4] # Negative integers drop values diamonds[-(1:53900), -1] Tuesday, 31 August 2010
22. 22. # Useful technique: Order by one or more columns diamonds <- diamonds[order(diamonds\$price), ] # Useful technique: Combine two tables carats <- data.frame(table(carat = diamonds\$carat)) mtch <- match(diamonds\$carat, carats\$carat) diamonds\$carat_count <- carats\$Freq[mtch] Tuesday, 31 August 2010
23. 23. Logical subsetting Tuesday, 31 August 2010
24. 24. # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds\$x > 10 head(x_big) sum(x_big) mean(x_big) table(x_big) diamonds\$x[x_big] diamonds[x_big, ] Tuesday, 31 August 2010
25. 25. small <- diamonds[diamonds\$carat < 1, ] lowqual <- diamonds[diamonds\$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% a # Boolean operators: & | ! b small <- diamonds\$carat < 1 & a | b diamonds\$price > 500 a & b lowqual <- diamonds\$colour == "D" | a & !b diamonds\$cut == "Fair" xor(a, b) Tuesday, 31 August 2010
26. 26. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Tuesday, 31 August 2010
27. 27. Your turn Select the diamonds that have: Equal x and y dimensions. Depth between 55 and 70. Carat smaller than the mean. Cost more than \$10,000 per carat. Are of good quality or better. Tuesday, 31 August 2010
28. 28. Saving results # Prints to screen diamonds[diamonds\$x > 10, ] # Saves to new data frame big <- diamonds[diamonds\$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds\$x < 10,] Tuesday, 31 August 2010
29. 29. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Tuesday, 31 August 2010
30. 30. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Tuesday, 31 August 2010
31. 31. equal_dim <- diamonds\$x == diamonds\$y equal <- diamonds[equal_dim, ] y_big <- diamonds\$y > 10 z_big <- diamonds\$z > 6 x_zero <- diamonds\$x == 0 y_zero <- diamonds\$y == 0 z_zero <- diamonds\$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Tuesday, 31 August 2010