SlideShare a Scribd company logo
Wrangling Data From Raw to
Tidy:
Preparation, Organization,
Quality Control, and
Communication
August 14, 2015
Professional Education and Knowledge Seminar
Purpose
• The purpose of this presentation is to identify and demonstrate
best practices for processing raw data into tidy data sets.
• This presentation will demonstrate these practices using Stata and
R.
• The examples are taken from a Department of Agriculture project
and an assignment from Johns Hopkins University’s Getting and
Cleaning Data course on Coursera.1
1 https://www.coursera.org/specialization/jhudatascience/1/certificate
2Wrangling Data From Raw to Tidy
Agenda
I. Preparation
II. Organization
III. Quality Control
IV. Communication
3Wrangling Data From Raw to Tidy
Cheat Sheet
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.
4Wrangling Data From Raw to Tidy
Preparation
• Do you have a codebook for the raw data?
• Do you have the ability to read through and validate (review,
summarize, graph, etc.) every variable?
• If the answer is no, return to your client (or data provider) and ask!
5Wrangling Data From Raw to Tidy
Organization
• Plan out steps for investigating the data.
• Create a code shell to include basic commands necessary for
organized and logical coding.
• Add variable and value labels to make data understandable.
6Wrangling Data From Raw to Tidy
Organizing Your Thoughts
7Wrangling Data From Raw to Tidy
• What form do you want your data set to be in?
• Should the data be long, wide?
• Who will be using this data and how will it be used?
• Will the output require a particular format employed for specific
projects or tasks?
Raw Data Desired Format
Setup a “Shell” Document with a Header
8Wrangling Data From Raw to Tidy
• Outline the major functions performed in the code in the header.
Plan Out How to Move from One Step to Another
• You may not know exactly how to get from one step to another.
• Take a step back and think about where you are going.
• Write out small sections of your code in the text editor the way you
think it would produce your expected results.
• Run the code and test that it produces those outputs.
• Think critically about your inputs and the desired form you want
your data to be in.
9Wrangling Data From Raw to Tidy
Data Cleaning Example: Installment Data
10Wrangling Data From Raw to Tidy
• Challenge: Generate wide-formatted loan payment schedule given
the loan tenor, installment amounts, and equal payment indicator.
• Rushed Solution: Use sophisticated and complex coding to extend
wide-formatted schedule by replacing missing installment values.
• Planned Solution: Reshape data into long format and replace values.
Data Cleaning Example: Installment Data
11Wrangling Data From Raw to Tidy
• Stata code to reshape data and extend loan schedule:
Data Cleaning Example: Installment Data
12Wrangling Data From Raw to Tidy
• Sample of observations after reshaping data to long-format:
Tenor Installments“Year”
Data Cleaning Example: Installment Data
13Wrangling Data From Raw to Tidy
• Sample of final dataset after reshaping loan schedule into wide
format (i.e., the desired loan schedule):
Label Variable Names and Values
• Why?
• It helps you!
– If you need to find variables quickly and easily and you don’t want to have to
store the variable descriptions in your brain’s memory.
• It helps your peers!
– If you spend a good amount of time with the data, variables become second
nature, but they will not necessarily be obvious to your colleagues.
• Try to interpret these variables: flp_asst_type_cd,
dir_loan_pgm_cd, tBodyGyro-arCoeff, tBodyAccJerkMag-mean.
14Wrangling Data From Raw to Tidy
Label Variable Names and Values in Stata
• label variable varname “variable label”
• label define valuename number “value label”
• label values varname valuename
15Wrangling Data From Raw to Tidy
Label Variable Names in R
• Functions - names() and sub():
• names(dataset) <- sub(“find”, “replace”,
names(dataset))
• perl = TRUE
• ignore.case = TRUE
16Wrangling Data From Raw to Tidy
Label Values in R
• Function - mutate() from the dplyr package:
• mutate(varname = factor(varname, levels =
c(levels), labels = c(labels))
17Wrangling Data From Raw to Tidy
Limit Each Column to One Type of Data
18Wrangling Data From Raw to Tidy
• Example of data with multiple variables in each column:
Data Cleaning Example: Phone Health Data
19Wrangling Data From Raw to Tidy
• R code to reshape phone health data and generate variables to
separate out each observation’s characteristics:
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- tidy_data %>%
gather(Variable, Average_Value,3:ncol(tidy_data))
Note: This code uses the tidyr and dplyr packages.
Data Cleaning Example: Phone Health Data
20Wrangling Data From Raw to Tidy
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- mutate(Axis = ifelse(grepl("X$",
Variable,perl=TRUE),1, ifelse(grepl("Y$",
Variable,perl=TRUE), 2, 3)) ... )
Note: This code uses the tidyr and dplyr packages.
Quality Control
21Wrangling Data From Raw to Tidy
• Create reproducible cleaning code.
• Rerun code from start to finish in a clean directory.
– Running code line by line can result in accidental manual errors, or
unexpected results.
– Also start in a completely blank directory to ensure the code produces the
results without referencing files you previously generated.
• Provide code, documentation, and expected output to a colleague
for feedback.
Communication
In almost every circumstance, commenting is the superior to not
commenting.
• If your code will be reviewed by a colleague (and I hope it will)
then your time upfront will save them time during review.
• If anyone, including yourself, will ever use this code again, it allows
you and them to understand it more easily.
• If you require multiple validations or cleanings of similar data sets,
commenting your code can indicate what code to reuse.
22Wrangling Data From Raw to Tidy
Comment, Comment, Comment….
23Wrangling Data From Raw to Tidy
• Heavily commented R code:
Provide a Processed Dataset
24Wrangling Data From Raw to Tidy
• Create a data dictionary containing the variable names, values,
labels, and units.
• Components of a Processed Dataset1:
– The raw data set
– A tidy data set
– A codebook describing each variable and its value in the tidy data.
– An explicit and exact recipe used to get from the raw to tidy data.
• If there are steps that cannot be coded, they need to be explicitly
described.
1 Taken from Getting and Cleaning Data Coursera course from Johns Hopkins University.
Summary
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.
25Wrangling Data From Raw to Tidy

More Related Content

What's hot

Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
raj.kamal13
 
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSIT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
ijseajournal
 
Data vault
Data vaultData vault
Data vault
Hennie De Nooijer
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinar
Impetus Technologies
 
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
Robert Nardella
 
RISK EVALUATION-1
RISK EVALUATION-1RISK EVALUATION-1
RISK EVALUATION-1
Stig-Arne Kristoffersen
 
Agile architecture
Agile architectureAgile architecture
Agile architecture
Magnus Mickelsson
 
Presentation by lavika upadhyay
Presentation by lavika upadhyayPresentation by lavika upadhyay
Presentation by lavika upadhyay
PMI_IREP_TP
 

What's hot (8)

Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSIT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
 
Data vault
Data vaultData vault
Data vault
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinar
 
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...
 
RISK EVALUATION-1
RISK EVALUATION-1RISK EVALUATION-1
RISK EVALUATION-1
 
Agile architecture
Agile architectureAgile architecture
Agile architecture
 
Presentation by lavika upadhyay
Presentation by lavika upadhyayPresentation by lavika upadhyay
Presentation by lavika upadhyay
 

Similar to 20150814 Wrangling Data From Raw to Tidy vs

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Data processing
Data processingData processing
Data processing
AnupamSingh211
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
Gerrit Klaschke, CSM
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
hktripathy
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
RINUSATHYAN
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_Informatica
Gouri Shankar M
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
DawitBirhanu13
 
data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
mukeshgarg02
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
ASHWIN808488
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
Anaya Zafar
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
Rai University
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
PrashantYadav931011
 
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Rodger Devine
 
Data science guide
Data science guideData science guide
Data science guide
gokulprasath06
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 

Similar to 20150814 Wrangling Data From Raw to Tidy vs (20)

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Data processing
Data processingData processing
Data processing
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_Informatica
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...
 
Data science guide
Data science guideData science guide
Data science guide
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
 

20150814 Wrangling Data From Raw to Tidy vs

  • 1. Wrangling Data From Raw to Tidy: Preparation, Organization, Quality Control, and Communication August 14, 2015 Professional Education and Knowledge Seminar
  • 2. Purpose • The purpose of this presentation is to identify and demonstrate best practices for processing raw data into tidy data sets. • This presentation will demonstrate these practices using Stata and R. • The examples are taken from a Department of Agriculture project and an assignment from Johns Hopkins University’s Getting and Cleaning Data course on Coursera.1 1 https://www.coursera.org/specialization/jhudatascience/1/certificate 2Wrangling Data From Raw to Tidy
  • 3. Agenda I. Preparation II. Organization III. Quality Control IV. Communication 3Wrangling Data From Raw to Tidy
  • 4. Cheat Sheet • Preparation – Do you have a code book? – Have you reviewed and validated the variables? • Organization – Plan out the final product and the steps necessary to create that output. – Label variables and values. • Quality Control – Create reproducible cleaning code. – Rerun code from start to finish in a clean directory. • Communication – Comment, Comment, Comment. – Create a codebook and provide raw and tidy datasets. 4Wrangling Data From Raw to Tidy
  • 5. Preparation • Do you have a codebook for the raw data? • Do you have the ability to read through and validate (review, summarize, graph, etc.) every variable? • If the answer is no, return to your client (or data provider) and ask! 5Wrangling Data From Raw to Tidy
  • 6. Organization • Plan out steps for investigating the data. • Create a code shell to include basic commands necessary for organized and logical coding. • Add variable and value labels to make data understandable. 6Wrangling Data From Raw to Tidy
  • 7. Organizing Your Thoughts 7Wrangling Data From Raw to Tidy • What form do you want your data set to be in? • Should the data be long, wide? • Who will be using this data and how will it be used? • Will the output require a particular format employed for specific projects or tasks? Raw Data Desired Format
  • 8. Setup a “Shell” Document with a Header 8Wrangling Data From Raw to Tidy • Outline the major functions performed in the code in the header.
  • 9. Plan Out How to Move from One Step to Another • You may not know exactly how to get from one step to another. • Take a step back and think about where you are going. • Write out small sections of your code in the text editor the way you think it would produce your expected results. • Run the code and test that it produces those outputs. • Think critically about your inputs and the desired form you want your data to be in. 9Wrangling Data From Raw to Tidy
  • 10. Data Cleaning Example: Installment Data 10Wrangling Data From Raw to Tidy • Challenge: Generate wide-formatted loan payment schedule given the loan tenor, installment amounts, and equal payment indicator. • Rushed Solution: Use sophisticated and complex coding to extend wide-formatted schedule by replacing missing installment values. • Planned Solution: Reshape data into long format and replace values.
  • 11. Data Cleaning Example: Installment Data 11Wrangling Data From Raw to Tidy • Stata code to reshape data and extend loan schedule:
  • 12. Data Cleaning Example: Installment Data 12Wrangling Data From Raw to Tidy • Sample of observations after reshaping data to long-format: Tenor Installments“Year”
  • 13. Data Cleaning Example: Installment Data 13Wrangling Data From Raw to Tidy • Sample of final dataset after reshaping loan schedule into wide format (i.e., the desired loan schedule):
  • 14. Label Variable Names and Values • Why? • It helps you! – If you need to find variables quickly and easily and you don’t want to have to store the variable descriptions in your brain’s memory. • It helps your peers! – If you spend a good amount of time with the data, variables become second nature, but they will not necessarily be obvious to your colleagues. • Try to interpret these variables: flp_asst_type_cd, dir_loan_pgm_cd, tBodyGyro-arCoeff, tBodyAccJerkMag-mean. 14Wrangling Data From Raw to Tidy
  • 15. Label Variable Names and Values in Stata • label variable varname “variable label” • label define valuename number “value label” • label values varname valuename 15Wrangling Data From Raw to Tidy
  • 16. Label Variable Names in R • Functions - names() and sub(): • names(dataset) <- sub(“find”, “replace”, names(dataset)) • perl = TRUE • ignore.case = TRUE 16Wrangling Data From Raw to Tidy
  • 17. Label Values in R • Function - mutate() from the dplyr package: • mutate(varname = factor(varname, levels = c(levels), labels = c(labels)) 17Wrangling Data From Raw to Tidy
  • 18. Limit Each Column to One Type of Data 18Wrangling Data From Raw to Tidy • Example of data with multiple variables in each column:
  • 19. Data Cleaning Example: Phone Health Data 19Wrangling Data From Raw to Tidy • R code to reshape phone health data and generate variables to separate out each observation’s characteristics: # Reshape Data Long and Group Related Variables # reshape_tidy_data <- tidy_data %>% gather(Variable, Average_Value,3:ncol(tidy_data)) Note: This code uses the tidyr and dplyr packages.
  • 20. Data Cleaning Example: Phone Health Data 20Wrangling Data From Raw to Tidy # Reshape Data Long and Group Related Variables # reshape_tidy_data <- mutate(Axis = ifelse(grepl("X$", Variable,perl=TRUE),1, ifelse(grepl("Y$", Variable,perl=TRUE), 2, 3)) ... ) Note: This code uses the tidyr and dplyr packages.
  • 21. Quality Control 21Wrangling Data From Raw to Tidy • Create reproducible cleaning code. • Rerun code from start to finish in a clean directory. – Running code line by line can result in accidental manual errors, or unexpected results. – Also start in a completely blank directory to ensure the code produces the results without referencing files you previously generated. • Provide code, documentation, and expected output to a colleague for feedback.
  • 22. Communication In almost every circumstance, commenting is the superior to not commenting. • If your code will be reviewed by a colleague (and I hope it will) then your time upfront will save them time during review. • If anyone, including yourself, will ever use this code again, it allows you and them to understand it more easily. • If you require multiple validations or cleanings of similar data sets, commenting your code can indicate what code to reuse. 22Wrangling Data From Raw to Tidy
  • 23. Comment, Comment, Comment…. 23Wrangling Data From Raw to Tidy • Heavily commented R code:
  • 24. Provide a Processed Dataset 24Wrangling Data From Raw to Tidy • Create a data dictionary containing the variable names, values, labels, and units. • Components of a Processed Dataset1: – The raw data set – A tidy data set – A codebook describing each variable and its value in the tidy data. – An explicit and exact recipe used to get from the raw to tidy data. • If there are steps that cannot be coded, they need to be explicitly described. 1 Taken from Getting and Cleaning Data Coursera course from Johns Hopkins University.
  • 25. Summary • Preparation – Do you have a code book? – Have you reviewed and validated the variables? • Organization – Plan out the final product and the steps necessary to create that output. – Label variables and values. • Quality Control – Create reproducible cleaning code. – Rerun code from start to finish in a clean directory. • Communication – Comment, Comment, Comment. – Create a codebook and provide raw and tidy datasets. 25Wrangling Data From Raw to Tidy