• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Management - Basic Concepts
 

Data Management - Basic Concepts

on

  • 208 views

Overview of basic reasons for managing data, and principles of Tidy Data. For introductory class in Storytelling with Data.

Overview of basic reasons for managing data, and principles of Tidy Data. For introductory class in Storytelling with Data.

Statistics

Views

Total Views
208
Views on SlideShare
208
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Management - Basic Concepts Data Management - Basic Concepts Presentation Transcript

    • DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica
    • DISPOSABLE DATA MANAGEMENT • Researchers know they need clean reliable data • The analysis really interests them • When data arrive do quick manual clean-up of any problems they see. • Often cut-and-paste in spreadsheets • Look for and fix anomalies • If no errors crop up in the analysis, they make a clean archive copy and forget about the data. The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
    • DISPOSABLE DATA MANAGEMENT • PROBLEM #1: More data arrive and they have to do the same cut-andpaste / sorting / combining operations over again. • PROBLEM #2: An anomaly appears in a later data set. She has to check all the earlier data to find out if it’s there too. It was a cut-and-paste error. • PROBLEM #3: The results look peculiar, or are opposite to the prediction. Was it the data handling or is it real? The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
    • GOOD DATA PRACTICES • ―It’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.‖ • ―Data cleaning and data transformation are two major bottlenecks in data analysis.‖ Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
    • Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA CLEANING It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy. Powerful tools and practices can help you turn messy data into clean data.
    • Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA TRANSFORMATION ―This is more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data… The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis.‖
    • TIDY DATA • Each variable forms a column • Each observation forms a row • Each data set contains information on only one observational unit of analysis (e.g., families, participants, participan t visits) Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
    • MESSY DATA • Column names represent data values instead of variable names • A single column contains data on multiple variables instead of a single variable • Variables are contained in both rows and columns instead of just columns • A single table contains more than one observational unit • Data about an observational unit is spread across multiple data sets Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
    • TIDY TOOLS • Tidy tools are those that accept, manipulate, and return tidy data. • Tidy tools are like Lego blocks—individually simple but flexible & powerful in combination. • What tools are tidy? • Most functions in R • Most transformations in SPSS or SAS • Relational databases (an entire skill of its own) • Spreadsheets are not tidy tools Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
    • SCI 2777 • We will learn about cleaning data first with untidy tools: spreadsheets and the like. • They are more familiar and easy to use right away • We will learn how to track the provenance even with our untidy tools. • Soon, we will use R for some tasks, and get some basic skills for using a tidy tool for cleaning data. Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
    • A CAUTIONARY EXAMPLE
    • • THOMAS HERNDON • Third-year economics grad student at UMass-Amherst (age 28) • Class assignment: replicate the findings of a published study. • Growth in a Time of Debt by Reinhart & Rogoff in American Economic Review • Finding: Growth drops off sharply if debt is high • Basis for austerity economics • Could not replicate Photo : The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm • Found 3-4 errors. Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. PERI Working Papers Number 322. http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
    • “There were actually four errors all together. Any one error by itself would not have been enough to cause the negative average. It was the combined effect of all four of them: They interacted with each other and amplified each other—almost like a perfect storm of errors.” Quote from: The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems from Next New Deal at http://bit.ly/1f1XUHG
    • DATA PROVENANCE • Main goals • Keep a record • Be able to replicate your steps • Facilitate collaboration (most data work uses a team) • Versioning • Some software automatically keeps old versions of files • Google docs (online files) does this • Dropbox also syncs files across all your devices, keeps a local copy on computers (ie one you can use when there is no internet)
    • TODAY • Look at the World Bank Data visually: what do we notice? • World Bank Data – computing variables in spreadsheet using the School of Data instructions. • Getting your first look at Graphs using the School of Data instructions. • Seeing versions of files in Google Drive
    • GOALS BY JANUARY 29 • Clean data from the World Bank • First graphs of variables • Practice in dreaming up analyses • Beginning to find our own data • Basic Descriptive Statistics in ALEKS • Basic Graphics in ALEKS • FUN with Design • First thoughts about your projects
    • DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica