Can’t restore incorrect values without
original data but can remove clearly
Remove entire row
Mark incorrect value as missing (NA)
When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
Detecting inconsistency is not always
Inconsistency = incorrect
To ﬁnd incorrect values you need to be
creative, combining graphics and data
One waiter recorded information
about each tip he received over a
period of a few months
Do men or women tip more?
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables
Gotham City +
Top 1% tax
Identiﬁer variable Measured variable
Index of random
Experimental design Measurement
predictors (Xi) response (Y)
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
# assign an appropriate variable name
names(tips1) <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]
Use melt to ﬁx the smoking variable. One
column should be enough to record
whether a person smokes or not.
Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
qplot(total_bill, tip, data = tip,
colour = ?)
qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)
Clean data is:
(factual and internally consistent)
(required variables: observations in rows, one column per
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
Clean data is:
(observations in rows, one column per variable)
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.