Data
                                cleaning
                                Stat405

                         Hadley Wic...
1. Intro to data cleaning
               2. Missing values
               3. Subsetting
               4. Modifying
      ...
Clean data is:
                   Columnar
                   (rectangular, observations in rows, variables in columns)

 ...
Correct
                   Can’t restore correct values without
                   original data but can remove clearly
  ...
What is a missing
                             value?
                   In R, written as NA. Has special
                ...
Your turn

                   Look at histograms and scatterplots of x,
                   y, z from the diamonds dataset
...
Plots

               qplot(x,   data = diamonds, binwidth = 0.1)
               qplot(y,   data = diamonds, binwidth = 0....
Modifying data
                   To modify, must first know how to extract,
                   or subset. Many different m...
$

                   Remember str(diamonds) ?
                   That hints at how to extract individual
                ...
[
            positive integers   select specified
            negative integers   omit specified
            characters    ...
Challenge
                   There is an equivalency between logical
                   (boolean) and numerical (set) inde...
# Nothing
     str(diamonds[, ])

     # Positive integers & nothing
     diamonds[1:6, ] # same as head(diamonds)
     di...
[ + logical vectors
               # The most complicated to understand, but
               # the most powerful. Lets you ...
Useful                 table(zeros)
        functions for                 sum(zeros)
      logical vectors                ...
x_big <- diamonds$x > 10
     diamonds[x_big, ]
     diamonds[x_big, "x"]
     diamonds[x_big, c("x", "y", "z")]

     sma...
And     a & b

                         Or      a | b

                         Not      !b

                         Xor ...
Saving results
               # Prints to screen
               diamonds[diamonds$x > 10, ]

               # Saves to new...
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Monday, 31...
Your turn

                   Extract diamonds with equal x & y.
                   Extract diamonds with incorrect/unusua...
equal <- diamonds[diamonds$x == diamonds$y, ]

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- ...
Aside: strategy

                   The biggest problem I see new
                   programmers make is trying to do too
...
Making new variables
               diamonds$pricepc <- diamonds$price /
                 diamonds$carat

               d...
Modifying values
                   Combination of subsetting and making
                   new variables:
               ...
diamonds$volume <- diamonds$x *
       diamonds$y * diamonds$z
     qplot(carat, volume, data = diamonds)

     # Fix prob...
Your turn
                   Fix the incorrect values and replot
                   scatterplots of x, y, and z. Are all t...
qplot(carat, volume, data = diamonds)
     qplot(carat, volume / carat, data = diamonds)

     weird_density <-
       (di...
Short cuts
                   You’ve been typing diamonds many many
                   times. There are three shortcuts: w...
weird_density <-
       (diamonds$volume / diamonds$carat) < 140 |
       (diamonds$volume / diamonds$carat) > 180
     we...
diamonds$volume <- diamonds$x * diamonds$y *
       diamonds$z
     diamonds$pricepc <- diamonds$price /
       diamonds$c...
Your turn

                   Try to convert your previous statements
                   to use with, subset and transform...
Next time

                   Learning how to use latex: a scientific
                   publishing program.
              ...
a & b   intersect(c, d)

                         a | b     union(c, d)

                          !b      setdiff(U, c)
 ...
Upcoming SlideShare
Loading in …5
×

03 Cleaning

1,002 views

Published on

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

03 Cleaning

  1. 1. Data cleaning Stat405 Hadley Wickham Monday, 31 August 2009
  2. 2. 1. Intro to data cleaning 2. Missing values 3. Subsetting 4. Modifying 5. Short cuts Monday, 31 August 2009
  3. 3. Clean data is: Columnar (rectangular, observations in rows, variables in columns) Consistent Concise Complete Correct Monday, 31 August 2009
  4. 4. Correct Can’t restore correct values without original data but can remove clearly incorrect values Options: Remove entire row Mark incorrect value as missing Monday, 31 August 2009
  5. 5. What is a missing value? In R, written as NA. Has special behaviour: NA + 3 = ? NA > 2 = ? mean(c(2, 7, 10, NA)) = ? NA == NA ? Use is.na() to see if a value is NA Many functions have na.rm argument Monday, 31 August 2009
  6. 6. Your turn Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Monday, 31 August 2009
  7. 7. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Monday, 31 August 2009
  8. 8. Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts. Basic structure: df$varname df[row index, column index] Monday, 31 August 2009
  9. 9. $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Monday, 31 August 2009
  10. 10. [ positive integers select specified negative integers omit specified characters extract named items nothing include everything logicals select T, omit F Monday, 31 August 2009
  11. 11. Challenge There is an equivalency between logical (boolean) and numerical (set) indexing. How do you change a logical index to a numeric index? And vice versa? What are the equivalents of the boolean operations for numerical indices? Monday, 31 August 2009
  12. 12. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Positive integers * 2 diamonds[1:10, 1:4] diamonds$carat[1:100] # Negative integers diamonds[-(1:53900), -1] # Character vector diamonds[, c("depth", "table")] diamonds[1:100, "carat"] Monday, 31 August 2009
  13. 13. [ + logical vectors # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) tail(x_big) sum(x_big) diamonds$x[x_big] diamonds[x_big, ] Monday, 31 August 2009
  14. 14. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Monday, 31 August 2009
  15. 15. x_big <- diamonds$x > 10 diamonds[x_big, ] diamonds[x_big, "x"] diamonds[x_big, c("x", "y", "z")] small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% # Boolean operators small <- diamonds$carat < 1 & diamonds$price > 500 lowqual <- diamonds$colour == "D" | diamonds$cut == "Fair" Monday, 31 August 2009
  16. 16. And a & b Or a | b Not !b Xor xor(a, b) Monday, 31 August 2009
  17. 17. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Monday, 31 August 2009
  18. 18. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Monday, 31 August 2009
  19. 19. Your turn Extract diamonds with equal x & y. Extract diamonds with incorrect/unusual x, y, or z values. Monday, 31 August 2009
  20. 20. equal <- diamonds[diamonds$x == diamonds$y, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros dbad <- diamonds[bad, ] Monday, 31 August 2009
  21. 21. Aside: strategy The biggest problem I see new programmers make is trying to do too much at once. Break the problem into pieces and solve the smallest piece first. Then check each piece before solving the next problem. Monday, 31 August 2009
  22. 22. Making new variables diamonds$pricepc <- diamonds$price / diamonds$carat diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(pricepc, carat, data = diamonds) qplot(carat, volume, data = diamonds) Monday, 31 August 2009
  23. 23. Modifying values Combination of subsetting and making new variables: diamonds$x[x_zero] <- NA diamonds$z[z_big] <- diamonds$z[z_big] / 10 These modify the data in place. Be careful! Monday, 31 August 2009
  24. 24. diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(carat, volume, data = diamonds) # Fix problems & replot diamonds$x[x_zero] <- NA diamonds$y[y_zero] <- NA diamonds$z[z_zero] <- NA diamonds$y[y_big] <- diamonds$y[y_big] / 10 diamonds$z[z_big] <- diamonds$z[z_big] / 10 diamonds$volume <- diamonds$x * diamonds$y * diamonds$z qplot(carat, volume, data = diamonds) Monday, 31 August 2009
  25. 25. Your turn Fix the incorrect values and replot scatterplots of x, y, and z. Are all the unusual values gone? Correct any other strange values. Hint: If qplot(a, b) is a straight line, qplot(a, a / b) will be a flat line. Makes selecting strange values much easier! Monday, 31 August 2009
  26. 26. qplot(carat, volume, data = diamonds) qplot(carat, volume / carat, data = diamonds) weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180 weird_density <- weird_density & !is.na(weird_density) diamonds[weird_density, c("x", "y", "z", "volume")] <- NA Monday, 31 August 2009
  27. 27. Short cuts You’ve been typing diamonds many many times. There are three shortcuts: with, subset and transform. These save typing, but may be a little harder to understand, and will not work in some situations. Useful tools, but don’t forget the basics. Monday, 31 August 2009
  28. 28. weird_density <- (diamonds$volume / diamonds$carat) < 140 | (diamonds$volume / diamonds$carat) > 180 weird_density <- with(diamonds, (volume / carat) < 140 | (volume / carat) > 180) diamonds[diamonds$carat < 1) subset(diamonds, carat < 1) equal <- diamonds[diamonds$x == diamonds$y, ] equal <- subset(diamonds, x == y) Monday, 31 August 2009
  29. 29. diamonds$volume <- diamonds$x * diamonds$y * diamonds$z diamonds$pricepc <- diamonds$price / diamonds$carat diamonds <- transform(diamonds, volume = x * y * z, pricepc = price / carat) Monday, 31 August 2009
  30. 30. Your turn Try to convert your previous statements to use with, subset and transform. Which ones convert easily? Which are hard? When is the shortcut actually a longcut? Monday, 31 August 2009
  31. 31. Next time Learning how to use latex: a scientific publishing program. If you’re using a laptop, please install latex from the links on the course webpage. Monday, 31 August 2009
  32. 32. a & b intersect(c, d) a | b union(c, d) !b setdiff(U, c) union(setdiff(c, d), xor(a, b) setdiff(d, c)) U = seq_along(a) c = which(a) a = U %in% c d = which(b) b = U %in% d Monday, 31 August 2009

×