02 Large
Upcoming SlideShare
Loading in...5
×
 

02 Large

on

  • 1,795 views

 

Statistics

Views

Total Views
1,795
Views on SlideShare
1,793
Embed Views
2

Actions

Likes
0
Downloads
43
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

02 Large 02 Large Presentation Transcript

  • Stat405 Graphics for large data Hadley Wickham Wednesday, 26 August 2009
  • Quick notes Registration now open. Will get class access once everyone is signed up. Please stack up the chairs at the end of class. Books are recommended, not required. Class mailing list: http:// groups.google.com/group/stat405 Wednesday, 26 August 2009
  • 1. Leftovers from last lecture 2. Introduction to the diamonds data 3. Histograms and bar charts 4. More boxplots 5. Scatterplots for large data 6. Homework Wednesday, 26 August 2009
  • What’s the ● ● 40 problem with ● this plot? ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● 10 15 20 25 30 35 qplot(cty, hwy, data = mpg) cty Wednesday, 26 August 2009
  • ● ● ● 40 ● ● ● ●● 35 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 30 ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● hwy ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● 25 ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 20 ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●●●● ● ● ●● ● ● ● ● ●●● ● ● ● ●●● ● ● 15 ●● ●● ●● ● ● ● ● ● ●● ●● 10 15 20 25 30 35 qplot(cty, hwy, data = mpg, geom = "jitter") cty Wednesday, 26 August 2009
  • ● ● ● 40 ● ● ● ●● 35 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 30 ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● hwy ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● 25 ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 20 ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●●●● ● ● ●● ● ● ● ● ●●● ● ● ● ●●● ● ● 15 ●● ●● ●● ● ● ● ● ● ●● geom controls ●● “type” of plot 10 15 20 25 30 35 qplot(cty, hwy, data = mpg, geom = "jitter") cty Wednesday, 26 August 2009
  • ● ● ● 40 ● ● 35 ● ● ● ● ● ● ● ● 30 ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● 2seater compact midsize minivan pickup subcompact suv qplot(class, hwy, data = mpg) class Wednesday, 26 August 2009
  • How could ● ● we improve ● 40 this plot? ● ● 35 ● ● Brainstorm ● ● ● ● ● ● 30 for 1 minute. ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● 2seater compact midsize minivan pickup subcompact suv qplot(class, hwy, data = mpg) class Wednesday, 26 August 2009
  • ● ● ● 40 ● ● 35 ● ● ● ● ● ● ● ● 30 ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● pickup suv minivan 2seater midsize subcompact compact reorder(class, hwy) Wednesday, 26 August 2009
  • ● ● ● 40 ● ● 35 ● ● ● ● ● ● ● ● 30 ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● Incredibly useful ● technique! ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy), hwy, data = mpg) reorder(class, hwy) Wednesday, 26 August 2009
  • ● ● ● 40 ● ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● hwy ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 25 ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy),reorder(class, hwy) = mpg, geom = "jitter") hwy, data Wednesday, 26 August 2009
  • ● ● ● 40 ● 35 ● 30 hwy ● ● 25 ● ● ● ● 20 ● 15 ● ● pickup suv minivan 2seater midsize subcompact compact qplot(reorder(class, hwy), hwy, data hwy)mpg, geom = "boxplot") reorder(class, = Wednesday, 26 August 2009
  • ● ● ● ● ● ● 40 ● ● ● ● 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 30 ● ● ● ● ●● ●● ● ● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● hwy ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 25 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● qplot(reorder(class,minivan pickup suv hwy), 2seater data = subcompact hwy, midsize mpg, compact geom = c("jitter", "boxplot")) reorder(class, hwy) Wednesday, 26 August 2009
  • Your turn Read the help for reorder. Redraw the previously plots with class ordered by median hwy. How would you put the jittered points on top of the boxplots? Wednesday, 26 August 2009
  • Aside: coding strategy At the end of each interactive session, you want a summary of everything you did. Two options: 1. Save everything you did with savehistory() then remove the unimportant bits. 2. Build up the important bits as you go. (this is how I work) Wednesday, 26 August 2009
  • Diamonds data ~54,000 round diamonds from http://www.diamondse.info/ Carat, colour, clarity, cut Total depth, table, depth, width, height Price Wednesday, 26 August 2009
  • x table width z depth = z / diameter table = table width / x * 100 Wednesday, 26 August 2009
  • Recall Write down five ways to inspect the diamonds dataset. You have one minute! Wednesday, 26 August 2009
  • Your turn Inspect the data and familiarise yourself with the variables. If you don’t know what they mean, look them up on wikipedia. Wednesday, 26 August 2009
  • Histogram & bar charts Wednesday, 26 August 2009
  • Histograms and barcharts Used to display the distribution of a variable Categorical variable → bar chart Continuous variable → histogram Wednesday, 26 August 2009
  • Always experiment with the bin width! Wednesday, 26 August 2009
  • Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) qplot(carat, data = diamonds, binwidth = 0.1) qplot(carat, data = diamonds, binwidth = 0.01) resolution(diamonds$carat) last_plot() + xlim(0, 3) Wednesday, 26 August 2009
  • Examples # With only one variable, qplot guesses that # you want a bar chart or histogram qplot(cut, data = diamonds) qplot(carat, data = diamonds) qplot(carat, data = diamonds, binwidth = 1) Common ggplot2 qplot(carat, data = diamonds, binwidth = 0.1) technique: adding qplot(carat, data = diamonds, binwidth = 0.01) together plot resolution(diamonds$carat) components last_plot() + xlim(0, 3) Wednesday, 26 August 2009
  • qplot(table, data = diamonds, binwidth = 1) # To zoom in on a plot region use xlim() and ylim() qplot(table, data = diamonds, binwidth = 1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) qplot(table, data = diamonds, binwidth = 0.1) + xlim(50, 70) + ylim(0, 50) # Note that this type of zooming discards data outside of the plot regions # See coord_cartesian() for an alternative Wednesday, 26 August 2009
  • Additional variables As with scatterplots can use aesthetics or faceting. Using aesthetics creates pretty, but ineffective, plots. The following examples show the difference, when investigation the relationship between cut and depth. Wednesday, 26 August 2009
  • 4000 3000 count 2000 1000 0 56 58 60 62 64 66 68 70 qplot(depth, data = diamonds, binwidth = 0.2) depth Wednesday, 26 August 2009
  • 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 0 qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Wednesday, 26 August 2009
  • 4000 3000 cut Fair Good count 2000 Very Good Premium Ideal 1000 Fill is the aesthetic 0 for fill colour qplot(depth, data = diamonds, binwidth = 0.2, 56 58 60 62 64 66 68 70 fill = cut) + xlim(55, 70) depth Wednesday, 26 August 2009
  • Fair Good Very Good 2500 2000 1500 1000 500 0 count Premium Ideal 2500 2000 1500 1000 500 0 qplot(depth, 62 64 66= 68 70 56 58 60 binwidth = 0.2) + 56 58 60 data diamonds, 62 64 66 68 70 56 58 60 62 64 66 68 70 xlim(55, 70) + facet_wrap(~depth cut) Wednesday, 26 August 2009
  • Your turn Explore the distribution of price. How does it vary with colour, or cut, and clarity? Practice zooming in on regions of interest. Wednesday, 26 August 2009
  • Weighting qplot(cut, data = diamonds, weight = carat) qplot(cut, data = diamonds, weight = price) # Also useful for pretabulated data cuts <- as.data.frame(table( cut = diamonds$cut)) qplot(cut, weight = Freq, data = cuts) Wednesday, 26 August 2009
  • Box and whisker plots Wednesday, 26 August 2009
  • Boxplots Less information than a histogram, but take up much less space. Already seen them used with discrete x values. Can also use with continuous x values, by specifying how we want the data grouped. Wednesday, 26 August 2009
  • qplot(table, price, data = diamonds) Wednesday, 26 August 2009
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 price 5000 50 60 70 80 90 qplot(table, price, data = diamonds, geom = "boxplot") table Wednesday, 26 August 2009
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Wednesday, 26 August 2009
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● price ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 One boxplot for each unique value of this aesthetic qplot(table, price, data = diamonds, geom 80 "boxplot", 50 60 70 = 90 group = round(table)) table Wednesday, 26 August 2009
  • Scatterplots Wednesday, 26 August 2009
  • Revision: Interpreting a scatterplot • Big patterns • Small patterns • Deviations from the pattern • Strange patterns Wednesday, 26 August 2009
  • Wednesday, 26 August 2009
  • Strong linear relationship. A number of outliers. Wednesday, 26 August 2009
  • Wednesday, 26 August 2009
  • Unusual striations. Two groups? Little relationship between table and price? Wednesday, 26 August 2009
  • Wednesday, 26 August 2009
  • Curved (exponential?) relationship. Outliers mostly cheaper than expected. Wednesday, 26 August 2009
  • But what’s the problem with all these plots? qplot(carat, price, data = diamonds) Wednesday, 26 August 2009
  • But what’s the problem with all these plots? In pairs, brainstorm solutions for 2 minutes. qplot(carat, price, data = diamonds) Wednesday, 26 August 2009
  • Ideas If x discrete, use boxplots. Use semi-transparent points. Divide into bins and count number of points in each bin (2d histogram). Display statistical summary. Wednesday, 26 August 2009
  • qplot(carat, price, data = diamonds, colour = I(alpha("black", 1/255))) qplot(carat, price, data = diamonds, geom = "bin2d") qplot(carat, price, data = diamonds, geom = "bin2d", bins = 100) qplot(carat, price, data = diamonds, geom = "hex") qplot(carat, price, data = diamonds) + geom_smooth() Wednesday, 26 August 2009
  • Your turn Practice doing these plots yourself. Read the online documentation for each plot type: http://had.co.nz/ggplot2 Wednesday, 26 August 2009
  • Homework Practice your graphics/data exploration skills with the diamonds or mpg data. Due in one week. Make sure to read the grading rubric, and find a colour printer. Wednesday, 26 August 2009