Data Munging: the good, the bad and the ugly

1,937 views

Published on

Daniel D. Gutierrez' talk at the LA R User Group meetup on Nov. 14, 2013

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,937
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
18
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data Munging: the good, the bad and the ugly

  1. 1. DATA MUNGING The Good, the Bad, and the Ugly Presented by: Daniel D. Gutierrez
  2. 2. DATA MUNGING CAN TAKE WORK • • • • • • I kept getting burned by data munging phase! The importance of data munging to the success of a data science project must be understood The level of difficulty depends on the quality of data More work required for dirty, inconsistent, malformed data Can often amount to 70% of overall project time & budget Need to work with person delivering data: ETL engineer
  3. 3. GIVE DATA MUNGING SOME RESPECT • Data munging phase is often trivialized • New data scientists not always informed about the complexity of data munging: Coursera • Example: amount of data munging work for winning entries for Kaggle competition: Heritage Health Network. Much data munging done in SQL
  4. 4. USE CASE EXAMPLE I was given a data set by a client domain “expert” She clearly wanted me to read her mind! The data was awful: inconsistent data types, loads of missing values, poor structure, outliers • Delivered in Excel • Took many meeting with department staff to iron out BEFORE the data munging could even commence • Feature engineering can become “social engineering” – traveling up the corporate food chain to get answers • • •
  5. 5. A DATA MUNGING RESOURCE • • • • Here is an outline from Hadley Wickham’s Ph.D. thesis “First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future” http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf In Chapter 2 he talks a lot about data munging using the reshape package: melting and casting
  6. 6. THANK YOU! • Web: www.amuletanalytics.com • Twitter: @AMULETAnalytics • Email: dan@amuletc.com

×