Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Management - Basic Concepts

2,004 views

Published on

Overview of basic reasons for managing data, and principles of Tidy Data. For introductory class in Storytelling with Data.

Published in: Education, Technology
  • Be the first to comment

Data Management - Basic Concepts

  1. 1. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica
  2. 2. DISPOSABLE DATA MANAGEMENT • Researchers know they need clean reliable data • The analysis really interests them • When data arrive do quick manual clean-up of any problems they see. • Often cut-and-paste in spreadsheets • Look for and fix anomalies • If no errors crop up in the analysis, they make a clean archive copy and forget about the data. The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  3. 3. DISPOSABLE DATA MANAGEMENT • PROBLEM #1: More data arrive and they have to do the same cut-andpaste / sorting / combining operations over again. • PROBLEM #2: An anomaly appears in a later data set. She has to check all the earlier data to find out if it’s there too. It was a cut-and-paste error. • PROBLEM #3: The results look peculiar, or are opposite to the prediction. Was it the data handling or is it real? The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  4. 4. GOOD DATA PRACTICES • ―It’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.‖ • ―Data cleaning and data transformation are two major bottlenecks in data analysis.‖ Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
  5. 5. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA CLEANING It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy. Powerful tools and practices can help you turn messy data into clean data.
  6. 6. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA TRANSFORMATION ―This is more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data… The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis.‖
  7. 7. TIDY DATA • Each variable forms a column • Each observation forms a row • Each data set contains information on only one observational unit of analysis (e.g., families, participants, participan t visits) Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  8. 8. MESSY DATA • Column names represent data values instead of variable names • A single column contains data on multiple variables instead of a single variable • Variables are contained in both rows and columns instead of just columns • A single table contains more than one observational unit • Data about an observational unit is spread across multiple data sets Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  9. 9. TIDY TOOLS • Tidy tools are those that accept, manipulate, and return tidy data. • Tidy tools are like Lego blocks—individually simple but flexible & powerful in combination. • What tools are tidy? • Most functions in R • Most transformations in SPSS or SAS • Relational databases (an entire skill of its own) • Spreadsheets are not tidy tools Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  10. 10. SCI 2777 • We will learn about cleaning data first with untidy tools: spreadsheets and the like. • They are more familiar and easy to use right away • We will learn how to track the provenance even with our untidy tools. • Soon, we will use R for some tasks, and get some basic skills for using a tidy tool for cleaning data. Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  11. 11. A CAUTIONARY EXAMPLE
  12. 12. • THOMAS HERNDON • Third-year economics grad student at UMass-Amherst (age 28) • Class assignment: replicate the findings of a published study. • Growth in a Time of Debt by Reinhart & Rogoff in American Economic Review • Finding: Growth drops off sharply if debt is high • Basis for austerity economics • Could not replicate Photo : The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm • Found 3-4 errors. Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. PERI Working Papers Number 322. http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
  13. 13. “There were actually four errors all together. Any one error by itself would not have been enough to cause the negative average. It was the combined effect of all four of them: They interacted with each other and amplified each other—almost like a perfect storm of errors.” Quote from: The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems from Next New Deal at http://bit.ly/1f1XUHG
  14. 14. DATA PROVENANCE • Main goals • Keep a record • Be able to replicate your steps • Facilitate collaboration (most data work uses a team) • Versioning • Some software automatically keeps old versions of files • Google docs (online files) does this • Dropbox also syncs files across all your devices, keeps a local copy on computers (ie one you can use when there is no internet)
  15. 15. TODAY • Look at the World Bank Data visually: what do we notice? • World Bank Data – computing variables in spreadsheet using the School of Data instructions. • Getting your first look at Graphs using the School of Data instructions. • Seeing versions of files in Google Drive
  16. 16. GOALS BY JANUARY 29 • Clean data from the World Bank • First graphs of variables • Practice in dreaming up analyses • Beginning to find our own data • Basic Descriptive Statistics in ALEKS • Basic Graphics in ALEKS • FUN with Design • First thoughts about your projects
  17. 17. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica

×