Upcoming SlideShare
×

# 02 large

1,402 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,402
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
30
0
Likes
0
Embeds 0
No embeds

No notes for slide

### 02 large

1. 1. Stat405 Graphics for large data Hadley Wickham Thursday, 26 August 2010
2. 2. Majoring in Stat • Declare early (even if you’re not sure) • Weekly lunches • Summer opportunities (research & internships) Thursday, 26 August 2010
3. 3. 1. Leftovers from last lecture 2. The diamonds data 3. Histograms and bar charts 4. More boxplots and scatterplots 5. Homework Thursday, 26 August 2010
4. 4. # Remember: start with ● ● library(ggplot2) ● 40 ● ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 25 ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy),reorder(class, hwy) = mpg, geom = "jitter") hwy, data Thursday, 26 August 2010
5. 5. ● ● ● 40 ● 35 ● 30 hwy ● ● 25 ● ● ● ● 20 ● 15 ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy), hwy, data hwy)mpg, geom = "boxplot") reorder(class, = Thursday, 26 August 2010
6. 6. ● ● ● ● ● ● 40 ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 30 ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● hwy ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 25 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● qplot(reorder(class,minivan pickup suv hwy), 2seater data = subcompact hwy, midsize mpg, compact geom = c("jitter", "boxplot")) reorder(class, hwy) Thursday, 26 August 2010
7. 7. Your turn Read the help for reorder. Redraw the previous plots with class ordered by median hwy. How would you put the jittered points on top of the boxplots? Thursday, 26 August 2010
8. 8. Diamonds Thursday, 26 August 2010
9. 9. Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price Thursday, 26 August 2010
10. 10. x table width z depth = z / diameter table = table width / x * 100 Thursday, 26 August 2010
11. 11. Recall Write down ﬁve ways to inspect the diamonds dataset. You have one minute! Thursday, 26 August 2010
12. 12. Your turn Inspect the data and familiarise yourself with the variables. If you don’t know what they mean, look them up on wikipedia. Thursday, 26 August 2010
13. 13. Histogram & bar charts Thursday, 26 August 2010
14. 14. Histograms and barcharts Used to display the distribution of a variable Categorical variable → bar chart Continuous variable → histogram Thursday, 26 August 2010
15. 15. Always experiment with the bin width! Thursday, 26 August 2010
16. 16. Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds\$carat) last_plot() + xlim(0, 3) Thursday, 26 August 2010
17. 17. Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) Common ggplot2 qplot(carat, data = diamonds, binwidth = 0.1) technique: adding qplot(carat, data = diamonds, binwidth = 0.01) together plot resolution(diamonds\$carat) components last_plot() + xlim(0, 3) Thursday, 26 August 2010
18. 18. qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region use xlim() and ylim() qplot(table, data = diamonds, binwidth = 1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) + ylim(0, 50) # Note that this type of zooming discards data outside of the plot regions # See coord_cartesian() for an alternative Thursday, 26 August 2010
19. 19. Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth. Thursday, 26 August 2010
20. 20. 4000 3000 count 2000 1000 0 56 58 60 62 64 66 68 70 qplot(depth, data = diamonds, binwidth = 0.2) depth Thursday, 26 August 2010
21. 21. 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 0 qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Thursday, 26 August 2010
22. 22. 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 Fill is the aesthetic 0 for ﬁll colour qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Thursday, 26 August 2010
23. 23. Fair Good Very Good 2500 2000 1500 1000 500 0 count Premium Ideal 2500 2000 1500 1000 500 0 qplot(depth, 62 64 66= 68 70 56 58 60 binwidth = 0.2) + 56 58 60 data diamonds, 62 64 66 68 70 56 58 60 62 64 66 68 70 xlim(55, 70) + facet_wrap(~depth cut) Thursday, 26 August 2010
24. 24. Your turn Explore the distribution of price. How does it vary with colour, or cut, and clarity? Practice zooming in on regions of interest. Thursday, 26 August 2010
25. 25. Box and whisker plots Thursday, 26 August 2010
26. 26. Boxplots Less information than a histogram, but take up much less space. Already seen them used with discrete x values. Can also use with continuous x values, by specifying how we want the data grouped. Thursday, 26 August 2010
27. 27. qplot(table, price, data = diamonds) Thursday, 26 August 2010
28. 28. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 price 5000 50 60 70 80 90 qplot(table, price, data = diamonds, geom = "boxplot") table Thursday, 26 August 2010
29. 29. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Thursday, 26 August 2010
30. 30. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 One boxplot for each unique value of this aesthetic qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Thursday, 26 August 2010
31. 31. Scatterplots Thursday, 26 August 2010
32. 32. Interpreting a scatterplot • Global patterns • Local patterns • Deviations Thursday, 26 August 2010
33. 33. Thursday, 26 August 2010
34. 34. Strong linear relationship. A number of outliers. Thursday, 26 August 2010
35. 35. Thursday, 26 August 2010
36. 36. Unusual striations. Two groups? Little relationship between table and price? Thursday, 26 August 2010
37. 37. Thursday, 26 August 2010
38. 38. Curved (exponential?) relationship. Outliers mostly cheaper than expected. Thursday, 26 August 2010
39. 39. But what’s the problem with all these plots? qplot(carat, price, data = diamonds) Thursday, 26 August 2010
40. 40. But what’s the problem with all these plots? In pairs, brainstorm solutions for 2 minutes. qplot(carat, price, data = diamonds) Thursday, 26 August 2010
41. 41. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Thursday, 26 August 2010
42. 42. Your turn Practice doing these plots yourself. Read the online documentation for each plot type: http://had.co.nz/ggplot2 Thursday, 26 August 2010
43. 43. Homework Practice your graphics/data exploration skills with the diamonds or mpg data. Due in one week. Make sure to read the grading rubric, and ﬁnd a colour printer. Thursday, 26 August 2010
44. 44. Asking questions You have two minutes to write down as many questions as you can come up with that you might want to answer about the diamonds data. Write your best question on a piece of paper and turn it in. Thursday, 26 August 2010