SlideShare a Scribd company logo
1 of 6
DATA MUNGING
The Good, the Bad, and the Ugly

Presented by: Daniel D. Gutierrez
DATA MUNGING CAN TAKE WORK

•
•
•
•
•
•

I kept getting burned by data munging phase!
The importance of data munging to the success of a data
science project must be understood
The level of difficulty depends on the quality of data
More work required for dirty, inconsistent, malformed data
Can often amount to 70% of overall project time & budget
Need to work with person delivering data: ETL engineer
GIVE DATA MUNGING SOME RESPECT

• Data munging phase is often trivialized
• New data scientists not always informed about the

complexity of data munging: Coursera
• Example: amount of data munging work for winning
entries for Kaggle competition: Heritage Health
Network. Much data munging done in SQL
USE CASE EXAMPLE

I was given a data set by a client domain “expert”
She clearly wanted me to read her mind!
The data was awful: inconsistent data types, loads of
missing values, poor structure, outliers
• Delivered in Excel
• Took many meeting with department staff to iron out
BEFORE the data munging could even commence
• Feature engineering can become “social engineering” –
traveling up the corporate food chain to get answers
•
•
•
A DATA MUNGING RESOURCE

•
•

•
•

Here is an outline from Hadley Wickham’s Ph.D. thesis
“First, you get the data in a form that you can work with ...
Second, you plot the data to get a feel for what is going on ...
Third, you iterate between graphics and models to build a
succinct quantitative summary of the data ... Finally, you look
back at what you have done, and contemplate what tools you
need to do better in the future”
http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf
In Chapter 2 he talks a lot about data munging using the reshape
package: melting and casting
THANK YOU!

• Web: www.amuletanalytics.com
• Twitter: @AMULETAnalytics
• Email: dan@amuletc.com

More Related Content

More from amuletc

Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....amuletc
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 ramuletc
 
LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110amuletc
 
Los Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of ListLos Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of Listamuletc
 
Introduction to Big Data for LABDUG
Introduction to Big Data for LABDUGIntroduction to Big Data for LABDUG
Introduction to Big Data for LABDUGamuletc
 
What is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D GutierrezWhat is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D Gutierrezamuletc
 

More from amuletc (6)

Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110
 
Los Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of ListLos Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of List
 
Introduction to Big Data for LABDUG
Introduction to Big Data for LABDUGIntroduction to Big Data for LABDUG
Introduction to Big Data for LABDUG
 
What is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D GutierrezWhat is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D Gutierrez
 

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Data Munging: the good, the bad and the ugly

  • 1. DATA MUNGING The Good, the Bad, and the Ugly Presented by: Daniel D. Gutierrez
  • 2. DATA MUNGING CAN TAKE WORK • • • • • • I kept getting burned by data munging phase! The importance of data munging to the success of a data science project must be understood The level of difficulty depends on the quality of data More work required for dirty, inconsistent, malformed data Can often amount to 70% of overall project time & budget Need to work with person delivering data: ETL engineer
  • 3. GIVE DATA MUNGING SOME RESPECT • Data munging phase is often trivialized • New data scientists not always informed about the complexity of data munging: Coursera • Example: amount of data munging work for winning entries for Kaggle competition: Heritage Health Network. Much data munging done in SQL
  • 4. USE CASE EXAMPLE I was given a data set by a client domain “expert” She clearly wanted me to read her mind! The data was awful: inconsistent data types, loads of missing values, poor structure, outliers • Delivered in Excel • Took many meeting with department staff to iron out BEFORE the data munging could even commence • Feature engineering can become “social engineering” – traveling up the corporate food chain to get answers • • •
  • 5. A DATA MUNGING RESOURCE • • • • Here is an outline from Hadley Wickham’s Ph.D. thesis “First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future” http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf In Chapter 2 he talks a lot about data munging using the reshape package: melting and casting
  • 6. THANK YOU! • Web: www.amuletanalytics.com • Twitter: @AMULETAnalytics • Email: dan@amuletc.com