Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The life changing magic of tidying up your data: The art and science of making data usable


Published on

Webinar presentation by John Spencer in August 2017

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

The life changing magic of tidying up your data: The art and science of making data usable

  1. 1. John Spencer MEASURE Evaluation University of North Carolina at Chapel Hill Webinar August 24, 2017 The life changing magic of tidying up your data The art and science of making data usable
  2. 2. Keep only those things that bring a “spark of joy” The life changing magic of tidying up Marie Kondo
  3. 3.
  4. 4. Outline What is tidy data? Why is it important in GIS? What tools exist to help?
  5. 5. Imagefrom:RforDataScience Grolemund, Wickham
  6. 6. File format is the specific way information is encoded for storage in a computer file. File format vs file structure Wikipedia
  7. 7. File structure how the data is stored in the file. File format vs file structure
  8. 8. You’ve found a great new data repository and you can’t wait to get data from it and start doing stuff like this
  9. 9. Onceyouget the data,itwill need tobecleaned up beforeusingit. Messy Data
  10. 10. Making Messy Data Tidy Messy data needs to be tidied up before it can be used.
  11. 11. Tidy Data Organized structure for data. 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10).
  12. 12. Untidy Data 1. Column names represent data values instead of variable names 2. A single column contains data on multiple variables instead of a single variable 3. Variables are contained in both rows and columns instead of just columns 4. A single table contains more than one observational unit 5. Data about an observational unit is spread across multiple data sets Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10).
  13. 13. “Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
  14. 14. Let’s get messy
  15. 15. Class Mammal Number of feet Horse 4 Dog 4 Cat 4 Reptile Snake 0 Turtle 4 Bird Eagle 2 Ostrich 2 Multiple data classes and species mixed in the same column Blank rows Easy for human to read, hard for a computer
  16. 16. Animal Number of feet Class Horse 4 Mammal Eagle 2 Bird Turtle 4 Reptile Dog 4 Mammal Snake 0 Reptile Ostrich 2 Bird Cat 4 Mammal Tidy data
  17. 17. Make it easier for computer programs to read data
  18. 18. Often found on websites or in reports
  19. 19. TidyuntidyUnitedNations’migrationdatawithtidyr KanNashida
  20. 20. Messy data and GIS
  21. 21. GIS wants to see well structured data Facility ID Name Latitude Longitude Number of staff 3K4R200 Eastern Health Clinic -47.48516 61.69449 13 27LS611 Southern Health Clinic -6.05422 19.66357 4 1N291B2 Western Health Clinic -48.36875 109.76463 9
  22. 22. Applicable beyond GIS data
  23. 23. Following basic tidy data protocols will make analysis with many other software programs easier to do.
  24. 24. Hadley Wickham has an R package, TidyR that can be very helpful in tidying data. R
  25. 25. Nicholas Hould has an overview of tools in Python programming language Tidy data in Python. Python
  26. 26. Stata provides tools; an overview of some of them are available via the Carolina Population Center Website Stata
  27. 27. Excel is not necessarily the best tool to change untidy data into tidy data, but there are some things it can do. Microsoft has a page describing how to clean data and offers some plugins that could be helpful: Excel
  28. 28. A good overview of some useful Excel functions can be found here: Excel
  29. 29. Other Data Formats XML • Extensible Markup Language • Designed to store and transport data • Well defined schema JSON • JavaScript Object Notation • Increasingly Common • GeoJSON variation for geographic data By definition the data is “tidy”
  30. 30. Advice
  31. 31. Advice for data producers • Include tidy data download options • Think about potential users of your data and what they need to use data effectively
  32. 32. Advice for data users • Look for tools that make the job easier • Look for alternative download sources that provide the data in tidy format • Share tools that you create
  33. 33.
  34. 34. This presentation was produced with the support of the United States Agency for International Development (USAID) under the terms of MEASURE Evaluation cooperative agreement AID-OAA-L-14-00004. MEASURE Evaluation is implemented by the Carolina Population Center, University of North Carolina at Chapel Hill in partnership with ICF International; John Snow, Inc.; Management Sciences for Health; Palladium; and Tulane University. Views expressed are not necessarily those of USAID or the United States government.