Your SlideShare is downloading. ×
Data Munging: the good, the bad and the ugly
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Munging: the good, the bad and the ugly


Published on

Daniel D. Gutierrez' talk at the LA R User Group meetup on Nov. 14, 2013

Daniel D. Gutierrez' talk at the LA R User Group meetup on Nov. 14, 2013

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. DATA MUNGING The Good, the Bad, and the Ugly Presented by: Daniel D. Gutierrez
  • 2. DATA MUNGING CAN TAKE WORK • • • • • • I kept getting burned by data munging phase! The importance of data munging to the success of a data science project must be understood The level of difficulty depends on the quality of data More work required for dirty, inconsistent, malformed data Can often amount to 70% of overall project time & budget Need to work with person delivering data: ETL engineer
  • 3. GIVE DATA MUNGING SOME RESPECT • Data munging phase is often trivialized • New data scientists not always informed about the complexity of data munging: Coursera • Example: amount of data munging work for winning entries for Kaggle competition: Heritage Health Network. Much data munging done in SQL
  • 4. USE CASE EXAMPLE I was given a data set by a client domain “expert” She clearly wanted me to read her mind! The data was awful: inconsistent data types, loads of missing values, poor structure, outliers • Delivered in Excel • Took many meeting with department staff to iron out BEFORE the data munging could even commence • Feature engineering can become “social engineering” – traveling up the corporate food chain to get answers • • •
  • 5. A DATA MUNGING RESOURCE • • • • Here is an outline from Hadley Wickham’s Ph.D. thesis “First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future” In Chapter 2 he talks a lot about data munging using the reshape package: melting and casting
  • 6. THANK YOU! • Web: • Twitter: @AMULETAnalytics • Email: