20150814 Wrangling Data From Raw to Tidy vs

Wrangling Data From Raw to
Tidy:
Preparation, Organization,
Quality Control, and
Communication
August 14, 2015
Professional Education and Knowledge Seminar

Purpose
• The purpose of this presentation is to identify and demonstrate
best practices for processing raw data into tidy data sets.
• This presentation will demonstrate these practices using Stata and
R.
• The examples are taken from a Department of Agriculture project
and an assignment from Johns Hopkins University’s Getting and
Cleaning Data course on Coursera.1
1 https://www.coursera.org/specialization/jhudatascience/1/certificate
2Wrangling Data From Raw to Tidy

Agenda
I. Preparation
II. Organization
III. Quality Control
IV. Communication

Cheat Sheet
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.

Preparation
• Do you have a codebook for the raw data?
• Do you have the ability to read through and validate (review,
summarize, graph, etc.) every variable?
• If the answer is no, return to your client (or data provider) and ask!

Organization
• Plan out steps for investigating the data.
• Create a code shell to include basic commands necessary for
organized and logical coding.
• Add variable and value labels to make data understandable.

Organizing Your Thoughts
• What form do you want your data set to be in?
• Should the data be long, wide?
• Who will be using this data and how will it be used?
• Will the output require a particular format employed for specific
projects or tasks?
Raw Data Desired Format

Setup a “Shell” Document with a Header
• Outline the major functions performed in the code in the header.

Plan Out How to Move from One Step to Another
• You may not know exactly how to get from one step to another.
• Take a step back and think about where you are going.
• Write out small sections of your code in the text editor the way you
think it would produce your expected results.
• Run the code and test that it produces those outputs.
• Think critically about your inputs and the desired form you want
your data to be in.

Data Cleaning Example: Installment Data
• Challenge: Generate wide-formatted loan payment schedule given
the loan tenor, installment amounts, and equal payment indicator.
• Rushed Solution: Use sophisticated and complex coding to extend
wide-formatted schedule by replacing missing installment values.
• Planned Solution: Reshape data into long format and replace values.

• Stata code to reshape data and extend loan schedule:

• Sample of observations after reshaping data to long-format:
Tenor Installments“Year”

• Sample of final dataset after reshaping loan schedule into wide
format (i.e., the desired loan schedule):

Label Variable Names and Values
• Why?
• It helps you!
– If you need to find variables quickly and easily and you don’t want to have to
store the variable descriptions in your brain’s memory.
• It helps your peers!
– If you spend a good amount of time with the data, variables become second
nature, but they will not necessarily be obvious to your colleagues.
• Try to interpret these variables: flp_asst_type_cd,
dir_loan_pgm_cd, tBodyGyro-arCoeff, tBodyAccJerkMag-mean.

Label Variable Names and Values in Stata
• label variable varname “variable label”
• label define valuename number “value label”
• label values varname valuename

Label Variable Names in R
• Functions - names() and sub():
• names(dataset) <- sub(“find”, “replace”,
names(dataset))
• perl = TRUE
• ignore.case = TRUE

Label Values in R
• Function - mutate() from the dplyr package:
• mutate(varname = factor(varname, levels =
c(levels), labels = c(labels))

Limit Each Column to One Type of Data
• Example of data with multiple variables in each column:

Data Cleaning Example: Phone Health Data
• R code to reshape phone health data and generate variables to
separate out each observation’s characteristics:
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- tidy_data %>%
gather(Variable, Average_Value,3:ncol(tidy_data))
Note: This code uses the tidyr and dplyr packages.

Data Cleaning Example: Phone Health Data
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- mutate(Axis = ifelse(grepl("X$",
Variable,perl=TRUE),1, ifelse(grepl("Y$",
Variable,perl=TRUE), 2, 3)) ... )
Note: This code uses the tidyr and dplyr packages.

Quality Control
• Create reproducible cleaning code.
• Rerun code from start to finish in a clean directory.
– Running code line by line can result in accidental manual errors, or
unexpected results.
– Also start in a completely blank directory to ensure the code produces the
results without referencing files you previously generated.
• Provide code, documentation, and expected output to a colleague
for feedback.

Communication
In almost every circumstance, commenting is the superior to not
commenting.
• If your code will be reviewed by a colleague (and I hope it will)
then your time upfront will save them time during review.
• If anyone, including yourself, will ever use this code again, it allows
you and them to understand it more easily.
• If you require multiple validations or cleanings of similar data sets,
commenting your code can indicate what code to reuse.

Comment, Comment, Comment….
• Heavily commented R code:

Provide a Processed Dataset
• Create a data dictionary containing the variable names, values,
labels, and units.
• Components of a Processed Dataset1:
– The raw data set
– A tidy data set
– A codebook describing each variable and its value in the tidy data.
– An explicit and exact recipe used to get from the raw to tidy data.
• If there are steps that cannot be coded, they need to be explicitly
described.
1 Taken from Getting and Cleaning Data Coursera course from Johns Hopkins University.

Summary
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.

20150814 Wrangling Data From Raw to Tidy vs

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to 20150814 Wrangling Data From Raw to Tidy vs

Similar to 20150814 Wrangling Data From Raw to Tidy vs (20)

20150814 Wrangling Data From Raw to Tidy vs