Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyst’s Nightmare or Laundering Massive Spreadsheets

By Feyzi Bagirov
PyData New York City 2017

Poor data quality frequently invalidates data analysis when performed on Excel data that underwent transformations, imputations, and manual manipulations. In this talk we will use Pandas to walk through Excel data analysis and illustrate several common pitfalls that make this analysis invalid.

  • Be the first to comment

Analyst’s Nightmare or Laundering Massive Spreadsheets

  1. 1. ANALYST’S NIGHTMARE OR LAUNDERING MASSIVE SPREADSHEETS An example of how analysis that overlooks data quality issues may go completely wrong By Feyzi Bagirov and Tanya Yarmola
  2. 2. Agenda ■ About us ■ Dirty Data types ■ Fit Bit dataset insights (pre-impute) ■ Fit Bit dataset insights (post-impute) ■ Q&A
  3. 3. About us ■ Vice President in Model Governance and Review at JP Morgan ■ Faculty of Analytics at Harrisburg University of Science and Technology ■ Data Science Advisor at
  4. 4. According to Gartner, Excel is still the most popular BI tool in the world ■ More and more powerful tools are available on the market ■ Spreadsheet however lives on: – Excel is the most widely used analytics tool in the world
  5. 5. Dirty Data ■ Significant quantities of data are stored and passed around in the spreadsheet formats ■ Analysis is also frequently performed without leaving Excel. ■ This aggravates data quality issues: – duplicates and nulls are overlooked – copy-pastes and manual imputations create additional errors – VLOOKUPS do not take duplicates into account ■ When the data happens to be not as clean as you hoped it to be, serious errors occur and reproduce through the spreadsheet work cycle.
  6. 6. According to IDG, cleaning and organizing data takes up to 60% of the data scientists’ time
  7. 7. Common types of dirty data ■ Missing data – Missing Completely At Random (MCAR) – Missing At Random (MAR) – Missing Not At Random (MNAR) ■ Duplicates ■ Outliers ■ Multiple comma-separated (or not) values that are stored in one column (common symptom) ■ Column headers are values, not variable names
  8. 8. Handling Dirty Data ■ You can handle dirty data on two levels: – Database level/manual clean inside the database – not efficient, does not scale well – Application level – recommended way, whenever possible Ø Identify the commonly occurring problems with your data and the tasks to fix them Ø Once you identified most common tasks related to your data cleanup, create scripts, that you are going to be run on every new dataset. Ø Whenever you have new type of errors in the new dataset, add the code to fix them to your scripts.
  9. 9. Concept of tidy data ■ “Tidy Data” by Hadley Wickham, “Journal of Statistical Software”, Aug 20141 ■ Principles of tidy data: – Observations as rows – Variables of columns – One type of observational unit per table (if table that suppose to contain characteristics of people, contains information about their pets, there are more observational units). 1
  10. 10. Objectives ■ To provide a simple example that illustrates how data quality issues may visibly affect results of an analysis ■ To estimate customer’s height based on average stride length and see whether results belong to expected ranges
  11. 11. Tools
  12. 12. ■ • A publicly available FitBit dataset1 that contains records on 33 customers with • minute-by-minute records on steps and intensities • daily distances travelled (FitBit estimate) • Data quality issues were introduced for illustration purposes – this also allows comparison with the original. Data
  13. 13. Data
  14. 14. Quick an dirty height calculation
  15. 15. Quick and Dirty Calculation Results
  16. 16. Let’s take a closer look at the data to see if we can correct for outlier mistakes
  17. 17. Initial observations ■ minuteSteps and minuteIntensities have different numbers of records - there may be duplicates. ■ Most values for Steps and Intesities are zeroes. ■ There are Nulls in minuteSteps ■ Numbers of unique user Ids are different. ■ Id in minuteSteps is an object datatype. ■ Max number of Steps per minute is 500 - this is over 8 steps per second - seems too high, potential outlier issue
  18. 18. Daily Distances observations More observations • Number of unique Ids matches minuteIntensities • SedentaryActiveDistance is mostly zero – exclusion should be OK
  19. 19. Analysis with Data Checks • Ids are mix of integers and strange strings • Should convert all to integers to match other datasets
  20. 20. Analysis with Data Checks(cont’d)
  21. 21. Analysis with Data Checks(cont’d)
  22. 22. Nulls and outliers • There are Nulls in minuteSteps • Max number of Steps per minute is 500 - this is over 8 steps per second - seems too high, potential outlier issue
  23. 23. Missing Values - Imputations Imputation is used when the data analysis techniques is not content robust. It can be done in several ways, but multiple imputations is recommended and is a relatively standard method: - Single imputation - Multiple imputation
  24. 24. Single Imputations ■ Mean substitution - replacing missing value with the mean of that value for all other cases. Does not change the sample mean for that variable, however, attenuates any correlations involving the imputed variables, because there is no guaranteed relationships between the imputed and measured variables) ■ Interpolation – a method of constructing new data points within the range of a discrete set of known data points.
  25. 25. ■ Partial deletion (Listwise deletion/casewise)- the most common means of dealing with missing data is listwise deletion (complete case), which is when all cases with missing values are deleted. If the data are MCAR, this will not add any bias, but it will decrease the power of the analysis (smaller sample size). ■ Pairwise deletion – deleting a case when it is missing a variable required for a particular analysis, but including that case in analysis for which all required variables are present. The main advantage of this method is that it is straightforward and easy to implement. Single Imputations (cont’d)
  26. 26. ■ Hot-deck – a missing value is imputed from a randomly selected similar record. ■ Cold deck – selects donors from another dataset. Due to the advances in computation power, more sophisticated methods have superseded the original random and sorted hot deck imputation techniques ■ Regression imputation - Available information for complete and incomplete cases is used to predict whether a value on a specific variable is missing or not. Fitted values from the regression model are then used to impute the missing values. It has the opposite problem of mean imputation – imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance, causing relationships to be over identified and suggest greater precision in the imputed values, supplying no uncertainty about that value. Single Imputations (cont’d)
  27. 27. Multiple Imputations ■ Multiple Imputation developed to deal with the problem of increased noise due to imputation by Rubin (1987). There are multiple methods of multiple imputation ■ The primary method is Multiple Imputation by Chained Equations (MICE) should be implemented only when the missing data follow the missing at random mechanism
  28. 28. Multiple Imputations (cont’d) ■ Advantages of Multiple Imputation: – An advantage over single imputation is that MI is flexible and can be used in cases, where the data is MCAR, MAR, and even when the data is MNAR. – By imputing multiple times, multiple imputation certainly accounts for the uncertainty and range of values that the true value could have taken. – Not difficult to implement ■ Disadvantages of Multiple Imputation: – Can be computationally expensive and not quite worth it.
  29. 29. Steps distributions per intensity Single imputations - Impute nulls and outliers using different methods: 1. mean value 2. interpolate between existing values 3. draw from the distribution of existing values (per customer)
  30. 30. Single imputation - Impute using mean
  31. 31. Single imputation - impute using interpolation
  32. 32. Impute using transform with random choice (hot-deck)
  33. 33. Calculate height function
  34. 34. Calculate height for different imputation versions and compare results
  35. 35. Q&A
  36. 36. Thanks! Feyzi Bagirov,, @FeyziBagirov Tanya Yarmola,, @TanyaYarmola