SlideShare a Scribd company logo
1 of 66
Download to read offline
It's Time to Start Caring About
Data Quality
Data Quality at Scale
Ignacio Elola
Everyone is talking about how
useful data is
data can save your business
data can save your life
but...
all that is only true if you have the
right data
data tend to be dirty and
unstructured
specially web data!
Let’s start simple
I’ve created an extractor
I’ve pass a bunch of queries (bulk)
and got a dataset
How can you QA this data?
eyeballing
eyeballing we can find anomalies
without having domain expertise
Quick summary:
- create extractors
- combine extractors
- schedule data extraction
What if we need to scale up?
if you have:
- more than ~3 datasources
- more than ~2 extractors per ds
- big volume of queries
- pre or post processing
you will need:
- people to create and maintain
extractors
- process to clean and validate
data
Data Quality
think about it pre and post data
extraction!
tips and tricks to increase data
quality
XPaths
//div[@id="priceBlock"]/table/tbody/tr/td
[b/@class="priceLarge"]/b
better than
//*[@id="priceBlock"]/table/tbody/tr[2]/td[2]/b[1]
Regex
More at:
http://support.import.
io/knowledgebase/articles/341182-xpaths-regex
http://www.w3schools.com/xsl/xpath_intro.asp
Required column
measuring data quality
completeness
coverage
post extraction data quality
improvements?
how we do it
Smart automation
anomaly detection
variance, variability, noise
normalization
confidence score
Human input
Transparency
summary
Data Quality is essential
think about it from the very
beginning
develop a process to measure
data quality before scaling up
if you don’t want to reinvent the
wheel - contact us!
Thank you
ignacio.elola@import.io

More Related Content

What's hot

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceRobert Smith
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...vinayiqbusiness
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javednouman javed
 
How is smart data cooked?
How is smart data cooked?How is smart data cooked?
How is smart data cooked?Ontotext
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesMatteo Redaelli
 

What's hot (11)

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data Science
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javed
 
How is smart data cooked?
How is smart data cooked?How is smart data cooked?
How is smart data cooked?
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databases
 
Free Data
Free DataFree Data
Free Data
 

Similar to 2015 - Extract SF - Data Quality

big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humansmark madsen
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyoneKaren Hsieh
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentationmillerca2
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and TypesAnjani Phuyal
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of dataHarsha MV
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data WarehouseAnupam Sharma
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneousChris Dwan
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
What is data science ?
What is data science ?What is data science ?
What is data science ?ShahlKv
 

Similar to 2015 - Extract SF - Data Quality (20)

big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humans
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
 
Msbi by quontra us
Msbi by quontra usMsbi by quontra us
Msbi by quontra us
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and Types
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of data
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data Warehouse
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Manage your Datasets
Manage your DatasetsManage your Datasets
Manage your Datasets
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 

2015 - Extract SF - Data Quality