Your SlideShare is downloading. ×
0
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
02 large
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

02 large

1,244

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,244
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Stat405 Graphics for large data Hadley Wickham Thursday, 26 August 2010
  • 2. Majoring in Stat • Declare early (even if you’re not sure) • Weekly lunches • Summer opportunities (research & internships) Thursday, 26 August 2010
  • 3. 1. Leftovers from last lecture 2. The diamonds data 3. Histograms and bar charts 4. More boxplots and scatterplots 5. Homework Thursday, 26 August 2010
  • 4. # Remember: start with ● ● library(ggplot2) ● 40 ● ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 25 ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy),reorder(class, hwy) = mpg, geom = "jitter") hwy, data Thursday, 26 August 2010
  • 5. ● ● ● 40 ● 35 ● 30 hwy ● ● 25 ● ● ● ● 20 ● 15 ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy), hwy, data hwy)mpg, geom = "boxplot") reorder(class, = Thursday, 26 August 2010
  • 6. ● ● ● ● ● ● 40 ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 30 ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● hwy ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 25 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● qplot(reorder(class,minivan pickup suv hwy), 2seater data = subcompact hwy, midsize mpg, compact geom = c("jitter", "boxplot")) reorder(class, hwy) Thursday, 26 August 2010
  • 7. Your turn Read the help for reorder. Redraw the previous plots with class ordered by median hwy. How would you put the jittered points on top of the boxplots? Thursday, 26 August 2010
  • 8. Diamonds Thursday, 26 August 2010
  • 9. Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price Thursday, 26 August 2010
  • 10. x table width z depth = z / diameter table = table width / x * 100 Thursday, 26 August 2010
  • 11. Recall Write down five ways to inspect the diamonds dataset. You have one minute! Thursday, 26 August 2010
  • 12. Your turn Inspect the data and familiarise yourself with the variables. If you don’t know what they mean, look them up on wikipedia. Thursday, 26 August 2010
  • 13. Histogram & bar charts Thursday, 26 August 2010
  • 14. Histograms and barcharts Used to display the distribution of a variable Categorical variable → bar chart Continuous variable → histogram Thursday, 26 August 2010
  • 15. Always experiment with the bin width! Thursday, 26 August 2010
  • 16. Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds$carat) last_plot() + xlim(0, 3) Thursday, 26 August 2010
  • 17. Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) Common ggplot2 qplot(carat, data = diamonds, binwidth = 0.1) technique: adding qplot(carat, data = diamonds, binwidth = 0.01) together plot resolution(diamonds$carat) components last_plot() + xlim(0, 3) Thursday, 26 August 2010
  • 18. qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region use xlim() and ylim() qplot(table, data = diamonds, binwidth = 1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) + ylim(0, 50) # Note that this type of zooming discards data outside of the plot regions # See coord_cartesian() for an alternative Thursday, 26 August 2010
  • 19. Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth. Thursday, 26 August 2010
  • 20. 4000 3000 count 2000 1000 0 56 58 60 62 64 66 68 70 qplot(depth, data = diamonds, binwidth = 0.2) depth Thursday, 26 August 2010
  • 21. 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 0 qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Thursday, 26 August 2010
  • 22. 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 Fill is the aesthetic 0 for fill colour qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Thursday, 26 August 2010
  • 23. Fair Good Very Good 2500 2000 1500 1000 500 0 count Premium Ideal 2500 2000 1500 1000 500 0 qplot(depth, 62 64 66= 68 70 56 58 60 binwidth = 0.2) + 56 58 60 data diamonds, 62 64 66 68 70 56 58 60 62 64 66 68 70 xlim(55, 70) + facet_wrap(~depth cut) Thursday, 26 August 2010
  • 24. Your turn Explore the distribution of price. How does it vary with colour, or cut, and clarity? Practice zooming in on regions of interest. Thursday, 26 August 2010
  • 25. Box and whisker plots Thursday, 26 August 2010
  • 26. Boxplots Less information than a histogram, but take up much less space. Already seen them used with discrete x values. Can also use with continuous x values, by specifying how we want the data grouped. Thursday, 26 August 2010
  • 27. qplot(table, price, data = diamonds) Thursday, 26 August 2010
  • 28. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 price 5000 50 60 70 80 90 qplot(table, price, data = diamonds, geom = "boxplot") table Thursday, 26 August 2010
  • 29. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Thursday, 26 August 2010
  • 30. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 One boxplot for each unique value of this aesthetic qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Thursday, 26 August 2010
  • 31. Scatterplots Thursday, 26 August 2010
  • 32. Interpreting a scatterplot • Global patterns • Local patterns • Deviations Thursday, 26 August 2010
  • 33. Thursday, 26 August 2010
  • 34. Strong linear relationship. A number of outliers. Thursday, 26 August 2010
  • 35. Thursday, 26 August 2010
  • 36. Unusual striations. Two groups? Little relationship between table and price? Thursday, 26 August 2010
  • 37. Thursday, 26 August 2010
  • 38. Curved (exponential?) relationship. Outliers mostly cheaper than expected. Thursday, 26 August 2010
  • 39. But what’s the problem with all these plots? qplot(carat, price, data = diamonds) Thursday, 26 August 2010
  • 40. But what’s the problem with all these plots? In pairs, brainstorm solutions for 2 minutes. qplot(carat, price, data = diamonds) Thursday, 26 August 2010
  • 41. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Thursday, 26 August 2010
  • 42. Your turn Practice doing these plots yourself. Read the online documentation for each plot type: http://had.co.nz/ggplot2 Thursday, 26 August 2010
  • 43. Homework Practice your graphics/data exploration skills with the diamonds or mpg data. Due in one week. Make sure to read the grading rubric, and find a colour printer. Thursday, 26 August 2010
  • 44. Asking questions You have two minutes to write down as many questions as you can come up with that you might want to answer about the diamonds data. Write your best question on a piece of paper and turn it in. Thursday, 26 August 2010

×