Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

04 Reports

778 views

Published on

Published in: Sports, Technology
  • Be the first to comment

  • Be the first to like this

04 Reports

  1. 1. If you’re using a laptop, start installing latex, from the instructions on the website Thursday, 2 September 2010
  2. 2. Stat405 Statistical reports Hadley Wickham Thursday, 2 September 2010
  3. 3. 1. More subsetting. 2. Missing values. 3. Statistical reports: data, code, graphics & written report Thursday, 2 September 2010
  4. 4. Office hours Me: before class, DH 2056 Garrett: Wednesday, 3pm, DH 1041 Lab access: you should now have it Thursday, 2 September 2010
  5. 5. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Thursday, 2 September 2010
  6. 6. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Thursday, 2 September 2010
  7. 7. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Thursday, 2 September 2010
  8. 8. equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Thursday, 2 September 2010
  9. 9. Missing values Thursday, 2 September 2010
  10. 10. Data errors Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values. In R, missing values are indicated by NA Thursday, 2 September 2010
  11. 11. Expression Guess Actual 5 + NA NA / 2 sum(c(5, NA)) mean(c(5, NA) NA < 3 NA == 3 NA == NA Thursday, 2 September 2010
  12. 12. NA behaviour Missing values propagate Use is.na() to check for missing values Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation. Thursday, 2 September 2010
  13. 13. # Can use subsetting + <- to change individual # values diamonds$x[diamonds$x == 0] <- NA diamonds$y[diamonds$y == 0] <- NA diamonds$z[diamonds$z == 0] <- NA y_big <- !is.na(diamonds$y) & diamonds$y > 10 diamonds$y[y_big] <- diamonds$y[y_big] / 10 z_big <- !is.na(diamonds$z) & diamonds$z > 6 diamonds$z[z_big] <- diamonds$z[z_big] / 10 Thursday, 2 September 2010
  14. 14. Your turn What happens if you don’t remove the missing values during the subsetting replacement? Why? Thursday, 2 September 2010
  15. 15. Statistical reports Thursday, 2 September 2010
  16. 16. Statistical reports Regardless of whether you go into academia or industry, you need to be able to present your findings. And you should be able to do more than just present them, you should be able to reproduce them. Thursday, 2 September 2010
  17. 17. In Data (.csv) on e di re + ct ryo Code (.r) + Graphics (.png, .pdf) + Written report (.tex) Thursday, 2 September 2010
  18. 18. Working directory Set your working directory to specify where files will be loaded from and saved to. From the terminal (linux or mac): the working directory is the directory you’re in when you start R On windows: File | Change dir. On the mac: ⌘-D Thursday, 2 September 2010
  19. 19. Data So far we’ve just used built in datasets Next week we’ll learn how to use external data Thursday, 2 September 2010
  20. 20. Code Thursday, 2 September 2010
  21. 21. Workflow At the end of each interactive session, you want a summary of everything you did Two options: Save everything that you did with savehistory(filename.r) then remove the unimportant bits Build up the important bits as you go Up to you - I prefer the second Thursday, 2 September 2010
  22. 22. R editor Linux: gedit (copy and paste - see website) Windows: File | New Script (press F5 to send line) Mac: File | New document (press command-enter to send) Thursday, 2 September 2010
  23. 23. Code is communication! Thursday, 2 September 2010
  24. 24. Code presentation Use comments (#) to describe what you are doing and to create scannable headings in your code Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces Lines should be at most 80 characters. If you have to break up a line, indent the following piece Thursday, 2 September 2010
  25. 25. qplot(table,depth,data=diamonds) qplot(table,depth,data=diamonds)+xlim (50,70)+ylim(50,70) qplot(table-depth,data=diamonds,geom="histo gram") qplot(table/depth,data=diamonds,geom="histo gram",binwidth=0.01)+xlim(0.8,1.2) Thursday, 2 September 2010
  26. 26. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
  27. 27. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
  28. 28. Graphics Thursday, 2 September 2010
  29. 29. Saving graphics # Uses size on screen: ggsave("my-plot.pdf") ggsave("my-plot.png") # Specify size ggsave("my-plot.pdf", width = 6, height = 6) # Remember to set your working # directory! Thursday, 2 September 2010
  30. 30. PDF PNG Vector based Raster based (can zoom in infinitely) (made up of pixels) Good for plots Good for most with thousands of plots points Thursday, 2 September 2010
  31. 31. Your turn Recreate some of the graphics from previous lectures and save them. Experiment with the scale and height and width settings. Modify the template to include them. Thursday, 2 September 2010
  32. 32. Written report Thursday, 2 September 2010
  33. 33. Latex We are going to use the open source document typesetting system called latex to produce our reports. This is widespread in statistics - if you ever write a journal article, you will probably write it in latex. (Not as useful if you’re not in grad school, but still an important skill) Thursday, 2 September 2010
  34. 34. Edit-Compile-Preview Edit: a text document with special formatting Compile: to produce a pdf Preview: with a pdf viewer See web page for system specifics. Thursday, 2 September 2010
  35. 35. Latex Template Sections Images Figures and cross-references Verbatim input (for code) Thursday, 2 September 2010
  36. 36. Your turn # Get the sample report wget http://had.co.nz/stat405/ resources/sample-report.zip unzip sample-report.zip cd sample-report gedit template.tex & pdflatex template.tex evince template.pdf # Experiment! Thursday, 2 September 2010
  37. 37. Your turn If not on linux, follow the instructions on the class website. If you feel comfortable, start on homework 2. Thursday, 2 September 2010
  38. 38. Homework Thursday, 2 September 2010

×