"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
02 Large
1. Stat405 Graphics for large data
Hadley Wickham
Thursday, 26 August 2010
2. Majoring in Stat
• Declare early (even if you’re not sure)
• Weekly lunches
• Summer opportunities
(research & internships)
Thursday, 26 August 2010
3. 1. Leftovers from last lecture
2. The diamonds data
3. Histograms and bar charts
4. More boxplots and scatterplots
5. Homework
Thursday, 26 August 2010
7. Your turn
Read the help for reorder. Redraw the
previous plots with class ordered by
median hwy.
How would you put the jittered points on
top of the boxplots?
Thursday, 26 August 2010
9. Diamonds data
~54,000 round diamonds from
http://www.diamondse.info/
Carat, colour, clarity, cut
Total depth, table, depth,
width, height
Price
Thursday, 26 August 2010
10. x
table width
z
depth = z / diameter
table = table width / x * 100
Thursday, 26 August 2010
11. Recall
Write down five ways to inspect the
diamonds dataset.
You have one minute!
Thursday, 26 August 2010
12. Your turn
Inspect the data and familiarise yourself
with the variables. If you don’t know what
they mean, look them up on wikipedia.
Thursday, 26 August 2010
13. Histogram &
bar charts
Thursday, 26 August 2010
14. Histograms and
barcharts
Used to display the distribution of a
variable
Categorical variable → bar chart
Continuous variable → histogram
Thursday, 26 August 2010
15. Always
experiment with
the bin width!
Thursday, 26 August 2010
16. Examples
# With only one variable, qplot guesses that
# you want a bar chart or histogram
qplot(cut, data = diamonds)
qplot(carat, data = diamonds)
qplot(carat, data = diamonds, binwidth = 1)
qplot(carat, data = diamonds, binwidth = 0.1)
qplot(carat, data = diamonds, binwidth = 0.01)
resolution(diamonds$carat)
last_plot() + xlim(0, 3)
Thursday, 26 August 2010
17. Examples
# With only one variable, qplot guesses that
# you want a bar chart or histogram
qplot(cut, data = diamonds)
qplot(carat, data = diamonds)
qplot(carat, data = diamonds, binwidth = 1)
Common ggplot2
qplot(carat, data = diamonds, binwidth = 0.1)
technique: adding
qplot(carat, data = diamonds, binwidth = 0.01)
together plot
resolution(diamonds$carat)
components
last_plot() + xlim(0, 3)
Thursday, 26 August 2010
18. qplot(table, data = diamonds, binwidth = 1)
# To zoom in on a plot region use xlim() and ylim()
qplot(table, data = diamonds, binwidth = 1) +
xlim(50, 70)
qplot(table, data = diamonds, binwidth = 0.1) +
xlim(50, 70)
qplot(table, data = diamonds, binwidth = 0.1) +
xlim(50, 70) + ylim(0, 50)
# Note that this type of zooming discards data
outside of the plot regions
# See coord_cartesian() for an alternative
Thursday, 26 August 2010
19. Additional variables
As with scatterplots can use aesthetics
or faceting. Using aesthetics creates
pretty, but ineffective, plots.
The following examples show the
difference, when investigation the
relationship between cut and depth.
Thursday, 26 August 2010
21. 4000
3000
cut
Fair
Good
count
2000 Very Good
Premium
Ideal
1000
0
qplot(depth, data = diamonds, binwidth = 0.2,
56 58 60 62 64 66 68 70
fill = cut) + xlim(55, 70)
depth
Thursday, 26 August 2010
22. 4000
3000
cut
Fair
Good
count
2000 Very Good
Premium
Ideal
1000
Fill is the aesthetic
0
for fill colour
qplot(depth, data = diamonds, binwidth = 0.2,
56 58 60 62 64 66 68 70
fill = cut) + xlim(55, 70)
depth
Thursday, 26 August 2010
24. Your turn
Explore the distribution of price.
How does it vary with colour, or cut, and
clarity?
Practice zooming in on regions of interest.
Thursday, 26 August 2010
25. Box and
whisker plots
Thursday, 26 August 2010
26. Boxplots
Less information than a histogram, but
take up much less space.
Already seen them used with discrete x
values. Can also use with continuous x
values, by specifying how we want the
data grouped.
Thursday, 26 August 2010
38. Curved (exponential?)
relationship. Outliers mostly
cheaper than expected.
Thursday, 26 August 2010
39. But what’s the
problem with
all these plots?
qplot(carat, price, data = diamonds)
Thursday, 26 August 2010
40. But what’s the
problem with
all these plots?
In pairs, brainstorm
solutions for 2 minutes.
qplot(carat, price, data = diamonds)
Thursday, 26 August 2010
41. Idea ggplot
Small points shape = I(".")
Transparency alpha = I(1/50)
Jittering geom = "jitter"
Smooth curve geom = "smooth"
geom = "bin2d" or
2d bins geom = "hex"
Density contours geom = "density2d"
Thursday, 26 August 2010
42. Your turn
Practice doing these plots yourself.
Read the online documentation for each
plot type: http://had.co.nz/ggplot2
Thursday, 26 August 2010
43. Homework
Practice your graphics/data exploration
skills with the diamonds or mpg data.
Due in one week.
Make sure to read the grading rubric, and
find a colour printer.
Thursday, 26 August 2010
44. Asking questions
You have two minutes to write down as
many questions as you can come up with
that you might want to answer about the
diamonds data.
Write your best question on a piece of
paper and turn it in.
Thursday, 26 August 2010