Your SlideShare is downloading. ×
Data Munging: the good, the bad and the ugly
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Munging: the good, the bad and the ugly

1,243
views

Published on

Daniel D. Gutierrez' talk at the LA R User Group meetup on Nov. 14, 2013

Daniel D. Gutierrez' talk at the LA R User Group meetup on Nov. 14, 2013

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,243
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. DATA MUNGING The Good, the Bad, and the Ugly Presented by: Daniel D. Gutierrez
  • 2. DATA MUNGING CAN TAKE WORK • • • • • • I kept getting burned by data munging phase! The importance of data munging to the success of a data science project must be understood The level of difficulty depends on the quality of data More work required for dirty, inconsistent, malformed data Can often amount to 70% of overall project time & budget Need to work with person delivering data: ETL engineer
  • 3. GIVE DATA MUNGING SOME RESPECT • Data munging phase is often trivialized • New data scientists not always informed about the complexity of data munging: Coursera • Example: amount of data munging work for winning entries for Kaggle competition: Heritage Health Network. Much data munging done in SQL
  • 4. USE CASE EXAMPLE I was given a data set by a client domain “expert” She clearly wanted me to read her mind! The data was awful: inconsistent data types, loads of missing values, poor structure, outliers • Delivered in Excel • Took many meeting with department staff to iron out BEFORE the data munging could even commence • Feature engineering can become “social engineering” – traveling up the corporate food chain to get answers • • •
  • 5. A DATA MUNGING RESOURCE • • • • Here is an outline from Hadley Wickham’s Ph.D. thesis “First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future” http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf In Chapter 2 he talks a lot about data munging using the reshape package: melting and casting
  • 6. THANK YOU! • Web: www.amuletanalytics.com • Twitter: @AMULETAnalytics • Email: dan@amuletc.com