SlideShare a Scribd company logo
1 of 34
Industry Overview and Business
Applicability
Why, What and How
Data Wrangling
Ashwini Kuntamukkala
Enterprise Architect @ Vizient, Inc
Twitter: @akuntamukkala
Goal: Better Faster Cheaper!
0
1
2
3
4
5
2013 2014 2015 2016
Product A
Product B
Product C
Insights
Better
Marketing
Campaign
* Typical Business End Game
My data are 100% accurate but are they?
Million(USD)
Vicious cycle
Bad Data
Incorrect
Analysis
Invalid
Insights
Wrong
Decisions
Poor
Outcomes
0
1
2
3
4
5
6
7
8
9
2013 2014 2015 2016
Revenue(million)
Data Quality is an issue…
Data Quality Issue
• Gartner Report
• By 2017, 33% of the largest global companies will experience an
information crisis due to their inability to adequately value, govern and
trust their enterprise information.
Cartoonmadeusinghttp://www.toondoo.com/
If you torture the data long enough, it will confess to anything – Darrell Huff
Noise to Signal?
DB
Machine
sensor
Data has a habit of replicating itself
Data Wrangling is …
transforming
“raw”
analyzed
insights
Data Wrangling: aka…
• Data Preprocessing
• Data Preparation
• Data Cleansing
• Data Scrubbing
• Data Munging
• Data Transformation
• Data Fold, Spindle, Mutilate… signal
noise
Data Wrangling Steps
Obtain Understand
Transform Augment
Shape
An approximate answer to the right problem is worth a good deal more than an
exact answer to an approximate problem. – John Tukey
• Iterative process
• Understand
• Explore
• Transform
• Augment
• Visualize Share
Let’s take a PDF Invoice…for example
Let’s take an image…
Python + Textract +Tesseract
Understand your data
“Looks like my V8 Chevy is running
low on fuel. Didn’t I fill up just the
day before?”
DALDFWSFOEWRBOSDCALAXORDJFKMCO
Owner Vehicle Type Fuel Level Engine Last Fill
AK Chevy Gas 5% V8 05/04/16
Or
DAL DFW SFO EWR BOS DCA LAX ORD JFK MCO
Outliers
Age(Years)
75
80
65
55
67
78
88
90
45
58
69
80
110
???
75
80
65
55
67
78
88
90
45
58
69
80
110
Missing ValuesMissing with a bias
Missing @ Random
Missing completely
Missing due to inapplicability
Missing due to invalid data and ingestion
Types of data
• Qualitative
– Subjective
• Quantitative
– Discrete
– Continuous
• Categorical
• Credible
• Complete
• Verifiable
• Accurate
• Current
• Compliance
Data Source Selection Criteria
• Accessible
• Cost
• Legal
• Security
• Storage
• Provenance
Tidy Data: Not all tables are created equal
School 2012 2013 2014
Good
Samaritans
2321 4550 1293
Percy Grammar 1540 1400 2949
Column
Row
year
School Year Student Count
Good Samaritans 2012 2321
Good Samaritans 2013 4550
Good Samaritans 2014 1293
Percy Grammar 2012 1540
Percy Grammar 2013 1400
Percy Grammar 2014 2949
Observation
Variable
Year Comedy-Q1 Thriller-Q1 Action-Q1 …
2014 2 1 0
2015 0 3 2
Tidy Data: Not all tables are created equal
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Comedy Q1 2015 0
Thriller Q1 2015 3
Action Q1 2015 2
Find total comedy movies in all of 2014? -> Not easy in current form
Find % of
hit
comedy
movies in
a 2015?
Very easy
to add a
new
column
Tidy Data: Not all tables are created equal
Category Rating Q1 Q2 Q3 …
Comedy Excellent 1 0 1
Comedy Good 2 0 2
Thriller Excellent 0 1 1
Thriller Good 1 0 3
Category Quarter Excellent Good
Comedy Q1 1 2
Comedy Q2 0 0
Comedy Q3 1 2
Thriller Q1 0 1
Thriller Q2 1 0
Thriller Q3 1 3
Very messy data
Variables in both rows and columns
Each row is complete
observation
Tidy Data: Not all tables are created equal
Invoice Bill To Sales % Total($) SKU# Item Qty Unit Price ($)
1 Jim Jones 8 8.03 A123 Hammer 1 3.55
1 Jim Jones 8 8.03 Q34 Screw Driver 2 2.05
2 Mike Z’Kale 8 97.20 W23 Hair Dryer 1 59.25
2 Mike Z’Kale 8 97.20 E452 Cologne 3 10.25
Invoice Bill To Sales % Total($)
1 Jim Jones 8 8.03
2 Mike Z’Kale 8 97.20
Invoice SKU# Item Qty Unit Price ($)
1 A123 Hammer 1 3.55
1 Q34 Screw Driver 2 2.05
2 W23 Hair Dryer 1 59.25
2 E452 Cologne 3 10.25
Normalize to avoid duplication
Tidy Data: Not all tables are created equal
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Multiple Tables
Divided by Time
Combine all tables
accommodating
varying formats
Schema-On-Design Vs Schema-On-Read
Spoil for Choices!
Popular Open Source Options
http://schoolofdata.org/
http://okfnlabs.org/
Commercial Vendors
Hands-On
Exercises
Hands on Data Wrangling
• Data Ingestion
– CSV
– PDF
– API/JSON
– HTML Web Scraping
• Data Exploration
– Visual inspection
– Graphing
• Data Shaping
– Tidying Data
• Data Cleansing
– Missing values
– Format
– Outliers
– Data Errors Per Domain
– Fat Fingered Data
• Data Augmenting
– Aggregate data sources
– Fuzzy/Exact match
R Basics
• Data Types
– Numeric
– Character
– Logical
– Categorical aka Factor
– Date
– List
– Matrix
– Data Frame
– Data Table
• Regular Expressions
• Libraries
– stringr
– dplyr
– tidyr
– readxl, xlsx
– lubridate
– gtools
– plyr
– rvest
• Control Statements
Trifacta Wrangler
Google’s Open Refine
Why should you care?
• Better Outcomes
• Tooling Innovation
• Increased
Productivity
• Ease of use
• Lessened skill gap
• Great skill to have
per Indeed.com 
Thank you & See you @
Dallas May 13-15 2016
• Las Colinas Convention
Center
500 West Las Colinas Boulevard,
Irving, TX 75039
Thank you for your participation

More Related Content

What's hot

Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesSlideTeam
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 

What's hot (20)

Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data visualization-tools
Data visualization-toolsData visualization-tools
Data visualization-tools
 
Data science
Data scienceData science
Data science
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation Slides
 
Data analytics
Data analyticsData analytics
Data analytics
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 

Viewers also liked

Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
Data Wrangling with Open Refine
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open RefineLOUIS Libraries
 
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Dataconomy Media
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceSkillet Tony
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"Naoto MATSUMOTO
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - TrifactaVictor Coustenoble
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaVictor Coustenoble
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Viewers also liked (20)

Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
Data Wrangling with Open Refine
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open Refine
 
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Open refine to update and clean up your messy data
Open refine to update and clean up your messy dataOpen refine to update and clean up your messy data
Open refine to update and clean up your messy data
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

Similar to Data Wrangling

Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Troy Magennis
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of DataMatt Turner
 
State of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsState of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsSPI Conference
 
Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...Lora Cecere
 
QUIRKS - Janvier 2015
QUIRKS - Janvier 2015QUIRKS - Janvier 2015
QUIRKS - Janvier 2015Ipsos France
 
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.Andreas Klinger
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - ForresterApigee | Google Cloud
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science ProcessVishal Patel
 
Mattermark 1st Series A Deck
Mattermark 1st Series A DeckMattermark 1st Series A Deck
Mattermark 1st Series A DeckDanielle Morrill
 
Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016Harvinder Atwal
 
Making Digital Marketing More Human
Making Digital Marketing More HumanMaking Digital Marketing More Human
Making Digital Marketing More HumanNoisy Little Monkey
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
How to Implement a Spend Analytics Program Using Machine Learning
 How to Implement a Spend Analytics Program Using Machine Learning How to Implement a Spend Analytics Program Using Machine Learning
How to Implement a Spend Analytics Program Using Machine LearningTamrMarketing
 
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...Branded3
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success StoriesDATAVERSITY
 
What Is Good DataViz Design?
What Is Good DataViz Design?What Is Good DataViz Design?
What Is Good DataViz Design?Randy Krum
 
Dollars and Sense of Sharing Threat Intelligence
Dollars and Sense of Sharing Threat IntelligenceDollars and Sense of Sharing Threat Intelligence
Dollars and Sense of Sharing Threat IntelligenceThreatConnect
 

Similar to Data Wrangling (20)

Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
 
Putting data science at the heart of business
Putting data science at the heart of businessPutting data science at the heart of business
Putting data science at the heart of business
 
State of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsState of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer Goods
 
Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...
 
QUIRKS - Janvier 2015
QUIRKS - Janvier 2015QUIRKS - Janvier 2015
QUIRKS - Janvier 2015
 
Galorath - Why can't people estimate
Galorath - Why can't people estimateGalorath - Why can't people estimate
Galorath - Why can't people estimate
 
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - Forrester
 
dotScale 2014
dotScale 2014dotScale 2014
dotScale 2014
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Mattermark 1st Series A Deck
Mattermark 1st Series A DeckMattermark 1st Series A Deck
Mattermark 1st Series A Deck
 
Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016
 
Making Digital Marketing More Human
Making Digital Marketing More HumanMaking Digital Marketing More Human
Making Digital Marketing More Human
 
Data Quality
Data QualityData Quality
Data Quality
 
How to Implement a Spend Analytics Program Using Machine Learning
 How to Implement a Spend Analytics Program Using Machine Learning How to Implement a Spend Analytics Program Using Machine Learning
How to Implement a Spend Analytics Program Using Machine Learning
 
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success Stories
 
What Is Good DataViz Design?
What Is Good DataViz Design?What Is Good DataViz Design?
What Is Good DataViz Design?
 
Dollars and Sense of Sharing Threat Intelligence
Dollars and Sense of Sharing Threat IntelligenceDollars and Sense of Sharing Threat Intelligence
Dollars and Sense of Sharing Threat Intelligence
 

Recently uploaded

一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxNidaFaviankaNawawi
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like BitcoinDOT TECH
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 

Recently uploaded (20)

一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 

Data Wrangling

  • 1.
  • 2. Industry Overview and Business Applicability Why, What and How Data Wrangling Ashwini Kuntamukkala Enterprise Architect @ Vizient, Inc Twitter: @akuntamukkala
  • 3. Goal: Better Faster Cheaper! 0 1 2 3 4 5 2013 2014 2015 2016 Product A Product B Product C Insights Better Marketing Campaign * Typical Business End Game My data are 100% accurate but are they? Million(USD)
  • 5. Data Quality Issue • Gartner Report • By 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information. Cartoonmadeusinghttp://www.toondoo.com/ If you torture the data long enough, it will confess to anything – Darrell Huff
  • 6. Noise to Signal? DB Machine sensor Data has a habit of replicating itself
  • 7. Data Wrangling is … transforming “raw” analyzed insights
  • 8. Data Wrangling: aka… • Data Preprocessing • Data Preparation • Data Cleansing • Data Scrubbing • Data Munging • Data Transformation • Data Fold, Spindle, Mutilate… signal noise
  • 9. Data Wrangling Steps Obtain Understand Transform Augment Shape An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. – John Tukey • Iterative process • Understand • Explore • Transform • Augment • Visualize Share
  • 10. Let’s take a PDF Invoice…for example
  • 11. Let’s take an image… Python + Textract +Tesseract
  • 12. Understand your data “Looks like my V8 Chevy is running low on fuel. Didn’t I fill up just the day before?” DALDFWSFOEWRBOSDCALAXORDJFKMCO Owner Vehicle Type Fuel Level Engine Last Fill AK Chevy Gas 5% V8 05/04/16 Or DAL DFW SFO EWR BOS DCA LAX ORD JFK MCO
  • 14. Missing ValuesMissing with a bias Missing @ Random Missing completely Missing due to inapplicability Missing due to invalid data and ingestion
  • 15. Types of data • Qualitative – Subjective • Quantitative – Discrete – Continuous • Categorical
  • 16. • Credible • Complete • Verifiable • Accurate • Current • Compliance Data Source Selection Criteria • Accessible • Cost • Legal • Security • Storage • Provenance
  • 17. Tidy Data: Not all tables are created equal School 2012 2013 2014 Good Samaritans 2321 4550 1293 Percy Grammar 1540 1400 2949 Column Row year School Year Student Count Good Samaritans 2012 2321 Good Samaritans 2013 4550 Good Samaritans 2014 1293 Percy Grammar 2012 1540 Percy Grammar 2013 1400 Percy Grammar 2014 2949 Observation Variable
  • 18. Year Comedy-Q1 Thriller-Q1 Action-Q1 … 2014 2 1 0 2015 0 3 2 Tidy Data: Not all tables are created equal Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Comedy Q1 2015 0 Thriller Q1 2015 3 Action Q1 2015 2 Find total comedy movies in all of 2014? -> Not easy in current form Find % of hit comedy movies in a 2015? Very easy to add a new column
  • 19. Tidy Data: Not all tables are created equal Category Rating Q1 Q2 Q3 … Comedy Excellent 1 0 1 Comedy Good 2 0 2 Thriller Excellent 0 1 1 Thriller Good 1 0 3 Category Quarter Excellent Good Comedy Q1 1 2 Comedy Q2 0 0 Comedy Q3 1 2 Thriller Q1 0 1 Thriller Q2 1 0 Thriller Q3 1 3 Very messy data Variables in both rows and columns Each row is complete observation
  • 20. Tidy Data: Not all tables are created equal Invoice Bill To Sales % Total($) SKU# Item Qty Unit Price ($) 1 Jim Jones 8 8.03 A123 Hammer 1 3.55 1 Jim Jones 8 8.03 Q34 Screw Driver 2 2.05 2 Mike Z’Kale 8 97.20 W23 Hair Dryer 1 59.25 2 Mike Z’Kale 8 97.20 E452 Cologne 3 10.25 Invoice Bill To Sales % Total($) 1 Jim Jones 8 8.03 2 Mike Z’Kale 8 97.20 Invoice SKU# Item Qty Unit Price ($) 1 A123 Hammer 1 3.55 1 Q34 Screw Driver 2 2.05 2 W23 Hair Dryer 1 59.25 2 E452 Cologne 3 10.25 Normalize to avoid duplication
  • 21. Tidy Data: Not all tables are created equal Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Multiple Tables Divided by Time Combine all tables accommodating varying formats
  • 28. Hands on Data Wrangling • Data Ingestion – CSV – PDF – API/JSON – HTML Web Scraping • Data Exploration – Visual inspection – Graphing • Data Shaping – Tidying Data • Data Cleansing – Missing values – Format – Outliers – Data Errors Per Domain – Fat Fingered Data • Data Augmenting – Aggregate data sources – Fuzzy/Exact match
  • 29. R Basics • Data Types – Numeric – Character – Logical – Categorical aka Factor – Date – List – Matrix – Data Frame – Data Table • Regular Expressions • Libraries – stringr – dplyr – tidyr – readxl, xlsx – lubridate – gtools – plyr – rvest • Control Statements
  • 32. Why should you care? • Better Outcomes • Tooling Innovation • Increased Productivity • Ease of use • Lessened skill gap • Great skill to have per Indeed.com 
  • 33. Thank you & See you @ Dallas May 13-15 2016 • Las Colinas Convention Center 500 West Las Colinas Boulevard, Irving, TX 75039
  • 34. Thank you for your participation

Editor's Notes

  1. This presentation demonstrates the new capabilities of PowerPoint and it is best viewed in Slide Show. These slides are designed to give you great ideas for the presentations you’ll create in PowerPoint 2011! For more sample templates, click the File menu, and then click New From Template. Under Templates, click Presentations.