Garrett Grolemund
Phd Student / Rice University
Department of Statistics
Data cleaning
1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? W...
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
10 - 20%
of an analysis
Data Cleaning
Data
Residuals
Model
Compare
Visualize
Transform
Data
cleaning
“Happy families are all alike;
every unhappy family is
unhappy in its own way.”
—Leo Tolstoy
“Clean datasets are all alike;
every messy dataset is
messy in its own way.”
—Hadley Wickham
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations i...
What you
can’t fix:
Complete
Correct
Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entir...
When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsisten...
General strategy
To find incorrect values you need to be
creative, combining graphics and data
processing.
Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men o...
Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are co...
tips <- read.csv("tipping.csv",
stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) &
!is.na(non_s...
nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)
What you
can fix:
Concise
(each fact represented once)
Repeating facts:
1. wastes memory
2. creates opportunities for inconsistency
Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your an...
This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ...
Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysi...
Rectangular
Observations
in rows
Variables
in columns
(1 column per variable)
Your turn
What are the variables in tipping.csv?
How are they arranged in rows and
columns? Can you form the variables int...
Reshape
install.packages("reshape")
library(reshape)
library(stringr)
head(tips)
Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending col...
1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measur...
object
ID Variables
Bruce Wayne
Batman
SSN:
555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)
ID Variables
Gotham City +
male +
Top 1% tax
bracket
Identifier variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measur...
Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the v...
tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable n...
Use melt to fix the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn
Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, t...
qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations i...
Resource
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
http://www.j...
Summary
Clean data is:
Rectangular
(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct
Data
Residuals
Model
Compare
Visualize
Transform
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
reshape
Data
Residuals
Model
Compare
Visualize
Transform
most statistics
classes
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of th...
18 cleaning
Upcoming SlideShare
Loading in …5
×

18 cleaning

1,430 views

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,430
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
32
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

18 cleaning

  1. 1. Garrett Grolemund Phd Student / Rice University Department of Statistics Data cleaning
  2. 2. 1. Intro to data cleaning 2. What you can’t fix 3. What you can fix 4. Intro to reshape
  3. 3. Your turn Do you think men or women leave a larger tip when dining out? What data would you collect to test this belief? What would prompt you to change your belief?
  4. 4. Data Analysis Data Residuals Model Compare Visualize Transform
  5. 5. Data Analysis Data Residuals Model Compare Visualize Transform
  6. 6. Data Analysis Data Residuals Model Compare Visualize Transform
  7. 7. Data Analysis Data Residuals Model Compare Visualize Transform
  8. 8. Data Analysis Data Residuals Model Compare Visualize Transform
  9. 9. Data Analysis Data Residuals Model Compare Visualize Transform
  10. 10. 10 - 20% of an analysis
  11. 11. Data Cleaning Data Residuals Model Compare Visualize Transform
  12. 12. Data cleaning
  13. 13. “Happy families are all alike; every unhappy family is unhappy in its own way.” —Leo Tolstoy
  14. 14. “Clean datasets are all alike; every messy dataset is messy in its own way.” —Hadley Wickham
  15. 15. Clean data is: Complete Correct (factual and internally consistent) Concise Compatible (required variables: observations in rows, one column per variable)
  16. 16. What you can’t fix:
  17. 17. Complete Correct
  18. 18. Correct Can’t restore incorrect values without original data but can remove clearly incorrect values Options: Remove entire row Mark incorrect value as missing (NA)
  19. 19. When two rows present the same information with different values, at least one row is wrong. Whenever there is inconsistency, you are going to have to make some tradeoff to ensure concision. Detecting inconsistency is not always easy. Inconsistency = incorrect
  20. 20. General strategy To find incorrect values you need to be creative, combining graphics and data processing.
  21. 21. Tipping data One waiter recorded information about each tip he received over a period of a few months 244 records Do men or women tip more?
  22. 22. Your turn Subset the tipping data to include only rows without NA’s. Judge whether you think all of the data points are correct. How will you make your decision?
  23. 23. tips <- read.csv("tipping.csv", stringsAsFactors = FALSE) summary(tips) tips <- subset(tips, !is.na(smoker) & !is.na(non_smoker)) qplot(tip, data = tips, binwidth = .5) qplot(total_bill, data = tips, binwidth = 2) qplot(total_bill, tip, data = tips)
  24. 24. nrow(tips) sum(tips$male) sum(tips$female) subset(tips, male != female)
  25. 25. What you can fix:
  26. 26. Concise (each fact represented once) Repeating facts: 1. wastes memory 2. creates opportunities for inconsistency
  27. 27. Compatible (Data is compatible with your analysis in both form and fact) 1. Do you have the relevant variables for your analysis?
  28. 28. This often requires some type of calculation. For example, proportion = sucesses / attempts Avg score per game per team = ? join(), transform(), summarise(), ddply(), plyr address this need
  29. 29. Compatible (Data is compatible with your analysis in both form and fact) 2. Is the data in the right form for your analysis and visualization tools? (reshape)
  30. 30. Rectangular
  31. 31. Observations in rows
  32. 32. Variables in columns (1 column per variable)
  33. 33. Your turn What are the variables in tipping.csv? How are they arranged in rows and columns? Can you form the variables into two groups?
  34. 34. Reshape
  35. 35. install.packages("reshape") library(reshape) library(stringr) head(tips)
  36. 36. Molten data We can use melt to put each variable into its own column. “Protect” the good columns. “Melt” the offending columns. Then subset.
  37. 37. 1. ID variables - identify the object that measurements will take place on (we know these before the experiment) 2. Measured variables - the features of the object that will be measured (we have to do an experiment to observe these) Two types of variables
  38. 38. object ID Variables Bruce Wayne Batman SSN: 555-89-3000 Measured Var. Height (6’1’’) IQ (180) Age (71)
  39. 39. ID Variables Gotham City + male + Top 1% tax bracket
  40. 40. Identifier variable Measured variable Index of random variable Random variable Dimension Measure Experimental design Measurement predictors (Xi) response (Y)
  41. 41. Molten data Molten data collapses all the measured variables into two columns: 1) the variable being measured and 2) the value. Sometimes called “long” form. To protect a column from being melted, label it as an id variable. reshape::melt(data, id)
  42. 42. tips1 <- melt(tips, id = c("customer_ID", "total_bill", "tip", "smoker", "non_smoker")) # assign an appropriate variable name names(tips1)[6] <- "sex" # subset out unwanted rows tips1 <- subset(tips1, value == 1) tips1 <- tips1[ , c(1,2,6,4,5,3)]
  43. 43. Use melt to fix the smoking variable. One column should be enough to record whether a person smokes or not. Your turn
  44. 44. Rectangular data are much easier to work with! qplot(total_bill, tip, data = tips1, color = sex) # vs. qplot(total_bill, tip, data = tip, colour = ?)
  45. 45. qplot(total_bill, tip, data = tips1, color = sex) + geom_smooth(method = lm)
  46. 46. Clean data is: Complete Correct (factual and internally consistent) Concise Compatible (required variables: observations in rows, one column per variable)
  47. 47. Resource Wickham, H. (2007) Reshaping data with the reshape package. Journal of Statistical Software. 22 (12) http://www.jstatsoft.org/v21/i12
  48. 48. Summary
  49. 49. Clean data is: Rectangular (observations in rows, one column per variable) Consistent Concise Complete Correct
  50. 50. Data Residuals Model Compare Visualize Transform
  51. 51. Data Residuals Model Compare Visualize Transform ggplot2
  52. 52. Data Residuals Model Compare Visualize Transform ggplot2 plyr
  53. 53. Data Residuals Model Compare Visualize Transform ggplot2 plyr reshape
  54. 54. Data Residuals Model Compare Visualize Transform most statistics classes
  55. 55. This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

×