The document provides instructions for a statistics course. It discusses installing LaTeX, office hours, lab access, subsetting data, missing values, creating statistical reports with data, code, graphics and a written report in LaTeX, and presenting homework assignments.
1. If you’re using a laptop, start
installing latex, from the
instructions on the website
Thursday, 2 September 2010
2. Office hours: before class.
Lab access: you should now
have it
Thursday, 2 September 2010
3. Stat405 Statistical reports
Hadley Wickham
Thursday, 2 September 2010
4. 1. More subsetting.
2. Missing values.
3. Statistical reports: data, code,
graphics & written report
Thursday, 2 September 2010
5. Saving results
# Prints to screen
diamonds[diamonds$x > 10, ]
# Saves to new data frame
big <- diamonds[diamonds$x > 10, ]
# Overwrites existing data frame. Dangerous!
diamonds <- diamonds[diamonds$x < 10,]
Thursday, 2 September 2010
7. Your turn
Create a logical vector that selects
diamonds with equal x & y. Create a new
dataset that only contains these values.
Create a logical vector that selects
diamonds with incorrect/unusual x, y, or z
values. Create a new dataset that omits
these values. (Hint: do this one variable
at a time)
Thursday, 2 September 2010
10. Data errors
Typically removing the entire row because
of one error is overkill. Better to
selectively replace problem values with
missing values.
In R, missing values are indicated by NA
Thursday, 2 September 2010
11. Expression Guess Actual
5 + NA
NA / 2
sum(c(5, NA))
mean(c(5, NA)
NA < 3
NA == 3
NA == NA
Thursday, 2 September 2010
12. NA behaviour
Missing values propagate
Use is.na() to check for missing values
Many functions (e.g. sum and mean) have
na.rm argument to remove missing values
prior to computation.
Thursday, 2 September 2010
13. # Can use subsetting + <- to change individual
# values
diamonds$x[diamonds$x == 0] <- NA
diamonds$y[diamonds$y == 0] <- NA
diamonds$z[diamonds$z == 0] <- NA
y_big <- !is.na(diamonds$y) & diamonds$y > 10
diamonds$y[y_big] <- diamonds$y[y_big] / 10
z_big <- !is.na(diamonds$z) & diamonds$z > 6
diamonds$z[z_big] <- diamonds$z[z_big] / 10
Thursday, 2 September 2010
14. Your turn
What happens if you don’t remove
missing values? Why?
Thursday, 2 September 2010
15. Statistical
reports
Thursday, 2 September 2010
16. Statistical reports
Regardless of whether you go into academia
or industry, you need to be able to present
your findings.
And you should be able to do more than just
present them, you should be able to
reproduce them.
Thursday, 2 September 2010
17. In
Data (.csv)
on
e
di
re
+
ct
o
ry
Code (.r)
+
Graphics (.png, .pdf)
+
Written report (.tex)
Thursday, 2 September 2010
18. Working directory
Set your working directory to specify
where files will be saved by default.
From the terminal (linux or mac): the
working directory is the directory you’re in
when you start R
On windows: File | Change dir.
On the mac: ⌘-D
Thursday, 2 September 2010
19. Data
So far we’ve just used built in datasets
Next week we’ll learn how to use external data
Thursday, 2 September 2010
21. Workflow
At the end of each interactive session, you
want a summary of everything you did
Two options:
Save everything that you did with
savehistory(filename.r) then remove the
unimportant bits
Build up the important bits as you go
Up to you - I prefer the second
Thursday, 2 September 2010
22. R editor
Linux: gedit
(copy and paste - see website)
Windows: File | New Script
(press F5 to send line)
Mac: File | New document
(press command-enter to send)
Thursday, 2 September 2010
23. Code is
communication!
Thursday, 2 September 2010
24. Code presentation
Use comments (#) to describe what you are
doing and to create scannable headings in
your code
Every comma should be followed by a space,
and every mathematical operator (+, -, =, *, /
etc) should be surrounded by spaces.
Parentheses do not need spaces
Lines should be at most 80 characters. If you
have to break up a line, indent the following
piece
Thursday, 2 September 2010
26. # Table and depth -------------------------
qplot(table, depth, data = diamonds)
qplot(table, depth, data = diamonds) +
xlim(50, 70) + ylim(50, 70)
# Is there a linear relationship?
qplot(table - depth, data = diamonds,
geom = "histogram")
# This bin width seems the most revealing
qplot(table / depth, data = diamonds,
geom = "histogram", binwidth = 0.01) +
xlim(0.8, 1.2)
# Also tried: 0.05, 0.005, 0.002
Thursday, 2 September 2010
27. # Table and depth -------------------------
qplot(table, depth, data = diamonds)
qplot(table, depth, data = diamonds) +
xlim(50, 70) + ylim(50, 70)
# Is there a linear relationship?
qplot(table - depth, data = diamonds,
geom = "histogram")
# This bin width seems the most revealing
qplot(table / depth, data = diamonds,
geom = "histogram", binwidth = 0.01) +
xlim(0.8, 1.2)
# Also tried: 0.05, 0.005, 0.002
Thursday, 2 September 2010
29. Saving graphics
# Uses size on screen:
ggsave("my-plot.pdf")
ggsave("my-plot.png")
# Specify size
ggsave("my-plot.pdf",
width = 6, height = 6)
# Saves file in working directory
# (where you started R from)
Thursday, 2 September 2010
30. PDF PNG
Vector based Raster based
(can zoom in infinitely) (made up of pixels)
Good for plots
Good for most
with thousands of
plots
points
Thursday, 2 September 2010
31. Your turn
Recreate some of the graphics from
previous lectures and save them.
Experiment with the scale and height and
width settings.
Modify the template to include them.
Thursday, 2 September 2010
33. Latex
We are going to use the open source
document typesetting system called latex to
produce our reports.
This is widespread in statistics - if you ever
write a journal article, you will probably write
it in latex.
(Not so useful if you’re not in grad school)
Thursday, 2 September 2010
34. Edit-Compile-Preview
Edit: a text document with special
formatting
Compile: to produce a pdf
Preview: with a pdf viewer
See web page for system specifics.
Thursday, 2 September 2010
35. Latex
Template
Sections
Images
Figures and cross-references
Verbatim input (for code)
Thursday, 2 September 2010
36. Your turn
# Get the sample report
wget http://had.co.nz/stat405/
resources/sample-report.zip
unzip sample-report.zip
cd sample-report
gedit template.tex &
pdflatex template.tex
evince template.pdf
# Experiment!
Thursday, 2 September 2010
37. Your turn
If not on linux, follow the instructions on
the class website.
If you feel comfortable, start on
homework 2.
Thursday, 2 September 2010