Successfully reported this slideshow.
Upcoming SlideShare
×

04 Reports

778 views

Published on

Published in: Sports, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

04 Reports

1. 1. If you’re using a laptop, start installing latex, from the instructions on the website Thursday, 2 September 2010
2. 2. Stat405 Statistical reports Hadley Wickham Thursday, 2 September 2010
3. 3. 1. More subsetting. 2. Missing values. 3. Statistical reports: data, code, graphics & written report Thursday, 2 September 2010
4. 4. Ofﬁce hours Me: before class, DH 2056 Garrett: Wednesday, 3pm, DH 1041 Lab access: you should now have it Thursday, 2 September 2010
5. 5. Saving results # Prints to screen diamonds[diamonds\$x > 10, ] # Saves to new data frame big <- diamonds[diamonds\$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds\$x < 10,] Thursday, 2 September 2010
6. 6. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Thursday, 2 September 2010
7. 7. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Thursday, 2 September 2010
8. 8. equal_dim <- diamonds\$x == diamonds\$y equal <- diamonds[equal_dim, ] y_big <- diamonds\$y > 10 z_big <- diamonds\$z > 6 x_zero <- diamonds\$x == 0 y_zero <- diamonds\$y == 0 z_zero <- diamonds\$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Thursday, 2 September 2010
9. 9. Missing values Thursday, 2 September 2010
10. 10. Data errors Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values. In R, missing values are indicated by NA Thursday, 2 September 2010
11. 11. Expression Guess Actual 5 + NA NA / 2 sum(c(5, NA)) mean(c(5, NA) NA < 3 NA == 3 NA == NA Thursday, 2 September 2010
12. 12. NA behaviour Missing values propagate Use is.na() to check for missing values Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation. Thursday, 2 September 2010
13. 13. # Can use subsetting + <- to change individual # values diamonds\$x[diamonds\$x == 0] <- NA diamonds\$y[diamonds\$y == 0] <- NA diamonds\$z[diamonds\$z == 0] <- NA y_big <- !is.na(diamonds\$y) & diamonds\$y > 10 diamonds\$y[y_big] <- diamonds\$y[y_big] / 10 z_big <- !is.na(diamonds\$z) & diamonds\$z > 6 diamonds\$z[z_big] <- diamonds\$z[z_big] / 10 Thursday, 2 September 2010
14. 14. Your turn What happens if you don’t remove the missing values during the subsetting replacement? Why? Thursday, 2 September 2010
15. 15. Statistical reports Thursday, 2 September 2010
16. 16. Statistical reports Regardless of whether you go into academia or industry, you need to be able to present your ﬁndings. And you should be able to do more than just present them, you should be able to reproduce them. Thursday, 2 September 2010
17. 17. In Data (.csv) on e di re + ct ryo Code (.r) + Graphics (.png, .pdf) + Written report (.tex) Thursday, 2 September 2010
18. 18. Working directory Set your working directory to specify where ﬁles will be loaded from and saved to. From the terminal (linux or mac): the working directory is the directory you’re in when you start R On windows: File | Change dir. On the mac: ⌘-D Thursday, 2 September 2010
19. 19. Data So far we’ve just used built in datasets Next week we’ll learn how to use external data Thursday, 2 September 2010
20. 20. Code Thursday, 2 September 2010
21. 21. Workﬂow At the end of each interactive session, you want a summary of everything you did Two options: Save everything that you did with savehistory(filename.r) then remove the unimportant bits Build up the important bits as you go Up to you - I prefer the second Thursday, 2 September 2010
22. 22. R editor Linux: gedit (copy and paste - see website) Windows: File | New Script (press F5 to send line) Mac: File | New document (press command-enter to send) Thursday, 2 September 2010
23. 23. Code is communication! Thursday, 2 September 2010
24. 24. Code presentation Use comments (#) to describe what you are doing and to create scannable headings in your code Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces Lines should be at most 80 characters. If you have to break up a line, indent the following piece Thursday, 2 September 2010
25. 25. qplot(table,depth,data=diamonds) qplot(table,depth,data=diamonds)+xlim (50,70)+ylim(50,70) qplot(table-depth,data=diamonds,geom="histo gram") qplot(table/depth,data=diamonds,geom="histo gram",binwidth=0.01)+xlim(0.8,1.2) Thursday, 2 September 2010
26. 26. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
27. 27. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
28. 28. Graphics Thursday, 2 September 2010
29. 29. Saving graphics # Uses size on screen: ggsave("my-plot.pdf") ggsave("my-plot.png") # Specify size ggsave("my-plot.pdf", width = 6, height = 6) # Remember to set your working # directory! Thursday, 2 September 2010
30. 30. PDF PNG Vector based Raster based (can zoom in inﬁnitely) (made up of pixels) Good for plots Good for most with thousands of plots points Thursday, 2 September 2010
31. 31. Your turn Recreate some of the graphics from previous lectures and save them. Experiment with the scale and height and width settings. Modify the template to include them. Thursday, 2 September 2010
32. 32. Written report Thursday, 2 September 2010
33. 33. Latex We are going to use the open source document typesetting system called latex to produce our reports. This is widespread in statistics - if you ever write a journal article, you will probably write it in latex. (Not as useful if you’re not in grad school, but still an important skill) Thursday, 2 September 2010
34. 34. Edit-Compile-Preview Edit: a text document with special formatting Compile: to produce a pdf Preview: with a pdf viewer See web page for system speciﬁcs. Thursday, 2 September 2010
35. 35. Latex Template Sections Images Figures and cross-references Verbatim input (for code) Thursday, 2 September 2010
36. 36. Your turn # Get the sample report wget http://had.co.nz/stat405/ resources/sample-report.zip unzip sample-report.zip cd sample-report gedit template.tex & pdflatex template.tex evince template.pdf # Experiment! Thursday, 2 September 2010
37. 37. Your turn If not on linux, follow the instructions on the class website. If you feel comfortable, start on homework 2. Thursday, 2 September 2010
38. 38. Homework Thursday, 2 September 2010