SlideShare a Scribd company logo
1 of 38
Download to read offline
If you’re using a laptop, start
               installing latex, from the
               instructions on the website



Thursday, 2 September 2010
Office hours: before class.

               Lab access: you should now
               have it



Thursday, 2 September 2010
Stat405              Statistical reports


                               Hadley Wickham
Thursday, 2 September 2010
1. More subsetting.
               2. Missing values.
               3. Statistical reports: data, code,
                  graphics & written report




Thursday, 2 September 2010
Saving results
               # Prints to screen
               diamonds[diamonds$x > 10, ]

               # Saves to new data frame
               big <- diamonds[diamonds$x > 10, ]

               # Overwrites existing data frame. Dangerous!
               diamonds <- diamonds[diamonds$x < 10,]



Thursday, 2 September 2010
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Thursday, 2 September 2010
Your turn
                    Create a logical vector that selects
                    diamonds with equal x & y. Create a new
                    dataset that only contains these values.
                    Create a logical vector that selects
                    diamonds with incorrect/unusual x, y, or z
                    values. Create a new dataset that omits
                    these values. (Hint: do this one variable
                    at a time)


Thursday, 2 September 2010
equal_dim <- diamonds$x == diamonds$y
     equal <- diamonds[equal_dim, ]

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0
     zeros <- x_zero | y_zero | z_zero

     bad <- y_big | z_big | zeros
     good <- diamonds[!bad, ]


Thursday, 2 September 2010
Missing
                             values
Thursday, 2 September 2010
Data errors

                    Typically removing the entire row because
                    of one error is overkill. Better to
                    selectively replace problem values with
                    missing values.
                    In R, missing values are indicated by NA




Thursday, 2 September 2010
Expression      Guess   Actual
                          5 + NA
                              NA / 2
                   sum(c(5, NA))
                   mean(c(5, NA)
                              NA < 3
                             NA == 3
                             NA == NA

Thursday, 2 September 2010
NA behaviour

                    Missing values propagate
                    Use is.na() to check for missing values
                    Many functions (e.g. sum and mean) have
                    na.rm argument to remove missing values
                    prior to computation.



Thursday, 2 September 2010
# Can use subsetting + <- to change individual
     # values

     diamonds$x[diamonds$x == 0] <- NA
     diamonds$y[diamonds$y == 0] <- NA
     diamonds$z[diamonds$z == 0] <- NA

     y_big <- !is.na(diamonds$y) & diamonds$y   > 10
     diamonds$y[y_big] <- diamonds$y[y_big] /   10
     z_big <- !is.na(diamonds$z) & diamonds$z   > 6
     diamonds$z[z_big] <- diamonds$z[z_big] /   10




Thursday, 2 September 2010
Your turn


                    What happens if you don’t remove
                    missing values? Why?




Thursday, 2 September 2010
Statistical
                      reports
Thursday, 2 September 2010
Statistical reports

                    Regardless of whether you go into academia
                    or industry, you need to be able to present
                    your findings.
                    And you should be able to do more than just
                    present them, you should be able to
                    reproduce them.




Thursday, 2 September 2010
In
                             Data (.csv)




                                          on
                                             e
                                              di
                                              re
                                  +




                                                 ct
                                                    o
                                                   ry
                             Code (.r)
                                  +
                        Graphics (.png, .pdf)
                                  +
                        Written report (.tex)
Thursday, 2 September 2010
Working directory
                    Set your working directory to specify
                    where files will be saved by default.
                    From the terminal (linux or mac): the
                    working directory is the directory you’re in
                    when you start R
                    On windows: File | Change dir.
                    On the mac: ⌘-D


Thursday, 2 September 2010
Data
              So far we’ve just used built in datasets
           Next week we’ll learn how to use external data



Thursday, 2 September 2010
Code

Thursday, 2 September 2010
Workflow
                    At the end of each interactive session, you
                    want a summary of everything you did
                    Two options:
                             Save everything that you did with
                             savehistory(filename.r) then remove the
                             unimportant bits
                             Build up the important bits as you go
                    Up to you - I prefer the second

Thursday, 2 September 2010
R editor

                    Linux: gedit
                    (copy and paste - see website)

                    Windows: File | New Script
                    (press F5 to send line)

                    Mac: File | New document
                    (press command-enter to send)




Thursday, 2 September 2010
Code is
                             communication!


Thursday, 2 September 2010
Code presentation
                    Use comments (#) to describe what you are
                    doing and to create scannable headings in
                    your code
                    Every comma should be followed by a space,
                    and every mathematical operator (+, -, =, *, /
                    etc) should be surrounded by spaces.
                    Parentheses do not need spaces
                    Lines should be at most 80 characters. If you
                    have to break up a line, indent the following
                    piece
Thursday, 2 September 2010
qplot(table,depth,data=diamonds)
                   qplot(table,depth,data=diamonds)+xlim
                   (50,70)+ylim(50,70)
                   qplot(table-depth,data=diamonds,geom="histo
                   gram")
                   qplot(table/depth,data=diamonds,geom="histo
                   gram",binwidth=0.01)+xlim(0.8,1.2)




Thursday, 2 September 2010
# Table and depth -------------------------

                  qplot(table, depth, data = diamonds)
                  qplot(table, depth, data = diamonds) +
                    xlim(50, 70) + ylim(50, 70)

                  # Is there a linear relationship?
                  qplot(table - depth, data = diamonds,
                    geom = "histogram")

                  # This bin width seems the most revealing
                  qplot(table / depth, data = diamonds,
                    geom = "histogram", binwidth = 0.01) +
                    xlim(0.8, 1.2)
                  # Also tried: 0.05, 0.005, 0.002


Thursday, 2 September 2010
# Table and depth -------------------------

                  qplot(table, depth, data = diamonds)
                  qplot(table, depth, data = diamonds) +
                    xlim(50, 70) + ylim(50, 70)

                  # Is there a linear relationship?
                  qplot(table - depth, data = diamonds,
                    geom = "histogram")

                  # This bin width seems the most revealing
                  qplot(table / depth, data = diamonds,
                    geom = "histogram", binwidth = 0.01) +
                    xlim(0.8, 1.2)
                  # Also tried: 0.05, 0.005, 0.002


Thursday, 2 September 2010
Graphics

Thursday, 2 September 2010
Saving graphics
                     # Uses size on screen:
                     ggsave("my-plot.pdf")
                     ggsave("my-plot.png")

                     # Specify size
                     ggsave("my-plot.pdf",
                       width = 6, height = 6)

                     # Saves file in working directory
                     # (where you started R from)

Thursday, 2 September 2010
PDF                  PNG

                         Vector based        Raster based
                 (can zoom in infinitely)    (made up of pixels)


                                            Good for plots
                       Good for most
                                           with thousands of
                          plots
                                                 points


Thursday, 2 September 2010
Your turn

                    Recreate some of the graphics from
                    previous lectures and save them.
                    Experiment with the scale and height and
                    width settings.
                    Modify the template to include them.



Thursday, 2 September 2010
Written
                             report
Thursday, 2 September 2010
Latex
                    We are going to use the open source
                    document typesetting system called latex to
                    produce our reports.
                    This is widespread in statistics - if you ever
                    write a journal article, you will probably write
                    it in latex.
                    (Not so useful if you’re not in grad school)



Thursday, 2 September 2010
Edit-Compile-Preview
                    Edit: a text document with special
                    formatting
                    Compile: to produce a pdf
                    Preview: with a pdf viewer


                    See web page for system specifics.


Thursday, 2 September 2010
Latex
                    Template
                    Sections
                    Images
                    Figures and cross-references
                    Verbatim input (for code)



Thursday, 2 September 2010
Your turn
                    # Get the sample report
                    wget http://had.co.nz/stat405/
                    resources/sample-report.zip
                    unzip sample-report.zip

                    cd sample-report
                    gedit template.tex &
                    pdflatex template.tex
                    evince template.pdf
                    # Experiment!


Thursday, 2 September 2010
Your turn

                    If not on linux, follow the instructions on
                    the class website.
                    If you feel comfortable, start on
                    homework 2.




Thursday, 2 September 2010
Homework



Thursday, 2 September 2010

More Related Content

Similar to 04 reports (20)

06 data
06 data06 data
06 data
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
22 spam
22 spam22 spam
22 spam
 
06 Data
06 Data06 Data
06 Data
 
noSQL @ QCon SP
noSQL @ QCon SPnoSQL @ QCon SP
noSQL @ QCon SP
 
Riak Intro
Riak IntroRiak Intro
Riak Intro
 
09 Data
09 Data09 Data
09 Data
 
R packages
R packagesR packages
R packages
 
08 Functions
08 Functions08 Functions
08 Functions
 
03 Cleaning
03 Cleaning03 Cleaning
03 Cleaning
 
Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
Clojure night
Clojure nightClojure night
Clojure night
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Seaside - Why should you care? (Dynamic Stockholm 2010)
Seaside - Why should you care? (Dynamic Stockholm 2010)Seaside - Why should you care? (Dynamic Stockholm 2010)
Seaside - Why should you care? (Dynamic Stockholm 2010)
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
Tool Time
Tool TimeTool Time
Tool Time
 
22 Spam
22 Spam22 Spam
22 Spam
 

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
03 extensions
03 extensions03 extensions
03 extensions
 
02 large
02 large02 large
02 large
 

04 reports

  • 1. If you’re using a laptop, start installing latex, from the instructions on the website Thursday, 2 September 2010
  • 2. Office hours: before class. Lab access: you should now have it Thursday, 2 September 2010
  • 3. Stat405 Statistical reports Hadley Wickham Thursday, 2 September 2010
  • 4. 1. More subsetting. 2. Missing values. 3. Statistical reports: data, code, graphics & written report Thursday, 2 September 2010
  • 5. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Thursday, 2 September 2010
  • 6. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Thursday, 2 September 2010
  • 7. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Thursday, 2 September 2010
  • 8. equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Thursday, 2 September 2010
  • 9. Missing values Thursday, 2 September 2010
  • 10. Data errors Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values. In R, missing values are indicated by NA Thursday, 2 September 2010
  • 11. Expression Guess Actual 5 + NA NA / 2 sum(c(5, NA)) mean(c(5, NA) NA < 3 NA == 3 NA == NA Thursday, 2 September 2010
  • 12. NA behaviour Missing values propagate Use is.na() to check for missing values Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation. Thursday, 2 September 2010
  • 13. # Can use subsetting + <- to change individual # values diamonds$x[diamonds$x == 0] <- NA diamonds$y[diamonds$y == 0] <- NA diamonds$z[diamonds$z == 0] <- NA y_big <- !is.na(diamonds$y) & diamonds$y > 10 diamonds$y[y_big] <- diamonds$y[y_big] / 10 z_big <- !is.na(diamonds$z) & diamonds$z > 6 diamonds$z[z_big] <- diamonds$z[z_big] / 10 Thursday, 2 September 2010
  • 14. Your turn What happens if you don’t remove missing values? Why? Thursday, 2 September 2010
  • 15. Statistical reports Thursday, 2 September 2010
  • 16. Statistical reports Regardless of whether you go into academia or industry, you need to be able to present your findings. And you should be able to do more than just present them, you should be able to reproduce them. Thursday, 2 September 2010
  • 17. In Data (.csv) on e di re + ct o ry Code (.r) + Graphics (.png, .pdf) + Written report (.tex) Thursday, 2 September 2010
  • 18. Working directory Set your working directory to specify where files will be saved by default. From the terminal (linux or mac): the working directory is the directory you’re in when you start R On windows: File | Change dir. On the mac: ⌘-D Thursday, 2 September 2010
  • 19. Data So far we’ve just used built in datasets Next week we’ll learn how to use external data Thursday, 2 September 2010
  • 21. Workflow At the end of each interactive session, you want a summary of everything you did Two options: Save everything that you did with savehistory(filename.r) then remove the unimportant bits Build up the important bits as you go Up to you - I prefer the second Thursday, 2 September 2010
  • 22. R editor Linux: gedit (copy and paste - see website) Windows: File | New Script (press F5 to send line) Mac: File | New document (press command-enter to send) Thursday, 2 September 2010
  • 23. Code is communication! Thursday, 2 September 2010
  • 24. Code presentation Use comments (#) to describe what you are doing and to create scannable headings in your code Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces Lines should be at most 80 characters. If you have to break up a line, indent the following piece Thursday, 2 September 2010
  • 25. qplot(table,depth,data=diamonds) qplot(table,depth,data=diamonds)+xlim (50,70)+ylim(50,70) qplot(table-depth,data=diamonds,geom="histo gram") qplot(table/depth,data=diamonds,geom="histo gram",binwidth=0.01)+xlim(0.8,1.2) Thursday, 2 September 2010
  • 26. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
  • 27. # Table and depth ------------------------- qplot(table, depth, data = diamonds) qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70) # Is there a linear relationship? qplot(table - depth, data = diamonds, geom = "histogram") # This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2) # Also tried: 0.05, 0.005, 0.002 Thursday, 2 September 2010
  • 29. Saving graphics # Uses size on screen: ggsave("my-plot.pdf") ggsave("my-plot.png") # Specify size ggsave("my-plot.pdf", width = 6, height = 6) # Saves file in working directory # (where you started R from) Thursday, 2 September 2010
  • 30. PDF PNG Vector based Raster based (can zoom in infinitely) (made up of pixels) Good for plots Good for most with thousands of plots points Thursday, 2 September 2010
  • 31. Your turn Recreate some of the graphics from previous lectures and save them. Experiment with the scale and height and width settings. Modify the template to include them. Thursday, 2 September 2010
  • 32. Written report Thursday, 2 September 2010
  • 33. Latex We are going to use the open source document typesetting system called latex to produce our reports. This is widespread in statistics - if you ever write a journal article, you will probably write it in latex. (Not so useful if you’re not in grad school) Thursday, 2 September 2010
  • 34. Edit-Compile-Preview Edit: a text document with special formatting Compile: to produce a pdf Preview: with a pdf viewer See web page for system specifics. Thursday, 2 September 2010
  • 35. Latex Template Sections Images Figures and cross-references Verbatim input (for code) Thursday, 2 September 2010
  • 36. Your turn # Get the sample report wget http://had.co.nz/stat405/ resources/sample-report.zip unzip sample-report.zip cd sample-report gedit template.tex & pdflatex template.tex evince template.pdf # Experiment! Thursday, 2 September 2010
  • 37. Your turn If not on linux, follow the instructions on the class website. If you feel comfortable, start on homework 2. Thursday, 2 September 2010