2. DATA MUNGING CAN TAKE WORK
•
•
•
•
•
•
I kept getting burned by data munging phase!
The importance of data munging to the success of a data
science project must be understood
The level of difficulty depends on the quality of data
More work required for dirty, inconsistent, malformed data
Can often amount to 70% of overall project time & budget
Need to work with person delivering data: ETL engineer
3. GIVE DATA MUNGING SOME RESPECT
• Data munging phase is often trivialized
• New data scientists not always informed about the
complexity of data munging: Coursera
• Example: amount of data munging work for winning
entries for Kaggle competition: Heritage Health
Network. Much data munging done in SQL
4. USE CASE EXAMPLE
I was given a data set by a client domain “expert”
She clearly wanted me to read her mind!
The data was awful: inconsistent data types, loads of
missing values, poor structure, outliers
• Delivered in Excel
• Took many meeting with department staff to iron out
BEFORE the data munging could even commence
• Feature engineering can become “social engineering” –
traveling up the corporate food chain to get answers
•
•
•
5. A DATA MUNGING RESOURCE
•
•
•
•
Here is an outline from Hadley Wickham’s Ph.D. thesis
“First, you get the data in a form that you can work with ...
Second, you plot the data to get a feel for what is going on ...
Third, you iterate between graphics and models to build a
succinct quantitative summary of the data ... Finally, you look
back at what you have done, and contemplate what tools you
need to do better in the future”
http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf
In Chapter 2 he talks a lot about data munging using the reshape
package: melting and casting