MEASURE Evaluation works to improve collection, analysis and presentation of data to promote better use of data in planning, policymaking, managing, monitoring and evaluating population, health and nutrition programs.
Aug. 24, 2017•0 likes•905 views
1 of 36
The life changing magic of tidying up your data: The art and science of making data usable
Aug. 24, 2017•0 likes•905 views
Report
Data & Analytics
Webinar presentation by John Spencer in August 2017
MEASURE Evaluation works to improve collection, analysis and presentation of data to promote better use of data in planning, policymaking, managing, monitoring and evaluating population, health and nutrition programs.
The life changing magic of tidying up your data: The art and science of making data usable
1. John Spencer
MEASURE Evaluation
University of North Carolina at Chapel Hill
Webinar
August 24, 2017
The life changing magic of
tidying up your data
The art and science of
making data usable
2. Keep only those things
that bring a “spark of
joy”
The life changing
magic of tidying up
Marie Kondo
11. Tidy
Data
Organized structure for data.
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit
forms a table.
Wickham, H. (2014). Tidy data. Journal of Statistical
Software, 59(10).
12. Untidy
Data
1. Column names represent data
values instead of variable names
2. A single column contains data
on multiple variables instead of
a single variable
3. Variables are contained in both
rows and columns instead of
just columns
4. A single table contains more
than one observational unit
5. Data about an observational
unit is spread across multiple
data sets
Wickham, H. (2014). Tidy data. Journal of Statistical
Software, 59(10).
13. “Happy families are all alike;
every unhappy family is
unhappy in its own way.”
–– Leo Tolstoy
“Tidy datasets are all alike, but
every messy dataset is messy
in its own way.”
–– Hadley Wickham
15. Class
Mammal Number of feet
Horse 4
Dog 4
Cat 4
Reptile
Snake 0
Turtle 4
Bird
Eagle 2
Ostrich 2
Multiple data classes
and species mixed in
the same column
Blank rows
Easy for human to
read, hard for a
computer
16. Animal Number of feet Class
Horse 4 Mammal
Eagle 2 Bird
Turtle 4 Reptile
Dog 4 Mammal
Snake 0 Reptile
Ostrich 2 Bird
Cat 4 Mammal
Tidy data
23. GIS wants to see well structured data
Facility ID Name Latitude Longitude Number of
staff
3K4R200 Eastern Health
Clinic
-47.48516 61.69449 13
27LS611 Southern Health
Clinic
-6.05422 19.66357 4
1N291B2 Western Health
Clinic
-48.36875 109.76463 9
25. Following basic tidy data protocols will make analysis
with many other software programs easier to do.
26. Hadley Wickham has an R
package, TidyR that can be
very helpful in tidying data.
R
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
27. Nicholas Hould has an
overview of tools in Python
programming language
Tidy data in Python.
Python
http://www.jeannicholashould.com/tidy-data-in-python.html
28. Stata provides tools; an
overview of some of them are
available via the Carolina
Population Center Website
Stata
http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial
29. Excel is not necessarily the
best tool to change untidy
data into tidy data, but there
are some things it can do.
Microsoft has a page
describing how to clean data
and offers some plugins that
could be helpful:
Excel
https://goo.gl/WGiUvp
30. A good overview of some
useful Excel functions can be
found here:
Excel
http://myexcelonline.com/blog/top-excel-data-cleansing-techniques/
31. Other Data
Formats
XML
• Extensible Markup Language
• Designed to store and transport data
• Well defined schema
JSON
• JavaScript Object Notation
• Increasingly Common
• GeoJSON variation for geographic data
By definition the data is “tidy”
33. Advice for
data
producers
• Include tidy data download
options
• Think about potential users
of your data and what they
need to use data effectively
34. Advice for
data users
• Look for tools that make the
job easier
• Look for alternative
download sources that
provide the data in tidy
format
• Share tools that you create
36. This presentation was produced with the support of the United States Agency for
International Development (USAID) under the terms of MEASURE Evaluation
cooperative agreement AID-OAA-L-14-00004. MEASURE Evaluation is
implemented by the Carolina Population Center, University of North Carolina at
Chapel Hill in partnership with ICF International; John Snow, Inc.; Management
Sciences for Health; Palladium; and Tulane University. Views expressed are not
necessarily those of USAID or the United States government.
www.measureevaluation.org