SlideShare a Scribd company logo
1 of 22
Using Hadoop to build a Data Quality
Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin
About Us:
• Alex Lv (lzhixing@ebay.com)
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya (amvaidya@ebay.com)
Lead Product Manager - Data Products
Platform & Engineering at eBay
Agenda
• Background
• Introduction to Griffin
• Demo
Background
eBay Marketplace at a Glance
Q2 2016 data
$19.8B
GMV in Q2 2016
10M
New listings added via mobile per
week
300M
Searches each day
65%
Transactions that ship for free
(in US, UK, DE)
80%
Items sold as new
1B
Live listings
One of the world’s largest and most vibrant marketplaces
Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec
Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec
Big Data @
We manage one of the
largest data platforms in
the world
We utilize one of the
largest data platforms in
the world
Challenging to ensure data quality for such scale!
Challenges at eBay:
• No unified view of data quality across multiple systems and teams
• No shared platform to manage data quality
• No system to measure near real-time data quality
What is Data Quality?
Definition
• How well it meets the expectations of
data consumers
• How well it represents the objects,
events, and concepts it is created to
represent
Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions
Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals,
thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality
Our Goal
A solution with all the below capabilities
Capability Commercial
DQ software
Open source
DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x
Griffin
What is Griffin?
• Data Quality Platform built on Hadoop
and Spark
 Batch data
 Real-time data
 Unstructured data
• A unified process to detect DQ issues
 Incomplete
 Inaccurate
 Invalid
 ……
• An open source solution
https://github.com/ebay/griffin
Data Quality Framework in Griffin
DefineMeasureAnalyze
• Define Data Quality Dimensions
• Define Metrics, Goals, Thresholds
Calculators running on
Source
RDBMS
Accuracy
Completeness
Uniqueness
Timeliness
Validity
Consistency
Metrics
Metrics
Repository
Scorecards
• Scorecard Reports generated and displayed
• Measurement values and quality scores calculated
and stored
• Data quality trending graphs generated
Measure
Repository
Technical Highlights
Component Design
Measure Calculator Example – Accuracy of Viewitem
~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view
• User Id
• Page Id
• Site Id
• Title
• Date
• ……
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) =
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … )
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒)
X 100%
Use Cases
• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics
Now life is easier……
Demo
We are open source
and welcome contributions
Github: https://github.com/eBay/griffin
Blog: http://www.ebaytechblog.com/?p=5877/
Contact: ebay-griffin-dev@googlegroups.com

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 

What's hot (20)

Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Modern Data Warehouse with Azure Synapse.pdf
Modern Data Warehouse with Azure Synapse.pdfModern Data Warehouse with Azure Synapse.pdf
Modern Data Warehouse with Azure Synapse.pdf
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Similar to Using Hadoop to build a Data Quality Service for both real-time and batch data

Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
Ryan Sleeper
 

Similar to Using Hadoop to build a Data Quality Service for both real-time and batch data (20)

GOTO Night: Decision Making Based on Machine Learning
GOTO Night: Decision Making Based on Machine LearningGOTO Night: Decision Making Based on Machine Learning
GOTO Night: Decision Making Based on Machine Learning
 
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
 
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Keys to Continuous Delivery Success - Mark Warren, Product Director, Perforc...
Keys to Continuous  Delivery Success - Mark Warren, Product Director, Perforc...Keys to Continuous  Delivery Success - Mark Warren, Product Director, Perforc...
Keys to Continuous Delivery Success - Mark Warren, Product Director, Perforc...
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in Production
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Advanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scaleAdvanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scale
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph frames
 
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di..."Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case Study
 
Data science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.HaralampievData science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.Haralampiev
 
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 

Using Hadoop to build a Data Quality Service for both real-time and batch data

  • 1. Using Hadoop to build a Data Quality Service for both real-time and batch data Griffin – https://github.com/ebay/griffin
  • 2. About Us: • Alex Lv (lzhixing@ebay.com) Senior Staff Software Engineer – Data Products Platform & Engineering at eBay • Amber Vaidya (amvaidya@ebay.com) Lead Product Manager - Data Products Platform & Engineering at eBay
  • 5. eBay Marketplace at a Glance Q2 2016 data $19.8B GMV in Q2 2016 10M New listings added via mobile per week 300M Searches each day 65% Transactions that ship for free (in US, UK, DE) 80% Items sold as new 1B Live listings One of the world’s largest and most vibrant marketplaces
  • 6. Velocity Stats US 3 car parts or accessories are sold every A smartphone is sold every A dress is sold every 1 sec 4 sec 6 sec UK A necklace is sold every A make-up product is sold every A Lego product is sold every 10 sec 3 sec 19 sec GERMANY A truck or car is sold every A pair of women’s jeans is sold every A video game is sold every 5 min 4 sec 11 sec AUSTRALIA A pair of men’s sunglasses is sold every A home décor item is sold every A car or truck part is sold every 1 min 12 sec 4 sec
  • 7. Mobile Velocity Stats US A woman’s handbag is sold every A car or truck is sold every An action figure is sold every 10 sec 5 min 10 sec UK A tablet is sold every A cookware item is sold every A car is sold every 1 min 6 sec 2 min GERMANY A pair of women’s shoes is sold every A watch is sold every A tire or car part is sold every 20 sec 48 sec 35 sec AUSTRALIA A piece of jewelry is sold every A baby clothing item is sold every A motorcycle part is sold every 12 sec 46 sec 51 sec
  • 8. Big Data @ We manage one of the largest data platforms in the world We utilize one of the largest data platforms in the world
  • 9. Challenging to ensure data quality for such scale! Challenges at eBay: • No unified view of data quality across multiple systems and teams • No shared platform to manage data quality • No system to measure near real-time data quality
  • 10. What is Data Quality? Definition • How well it meets the expectations of data consumers • How well it represents the objects, events, and concepts it is created to represent Dimensions Completeness Uniqueness Timeliness Validity Accuracy Consistency Core Dimensions
  • 11. Virtuous Cycle of Data Quality Define Measure AnalyzeImprove • Define the scope, dimensions, goals, thresholds, etc. • Measure data quality values • Analyze data quality results • Improve data quality
  • 12. Our Goal A solution with all the below capabilities Capability Commercial DQ software Open source DQ software Support eBay’s scale x x Data Quality measurement √ x Support real-time data x x Support unstructured data x x Service based API √ x Data Profiling √ √ Pluggable measurement types x x
  • 14. What is Griffin? • Data Quality Platform built on Hadoop and Spark  Batch data  Real-time data  Unstructured data • A unified process to detect DQ issues  Incomplete  Inaccurate  Invalid  …… • An open source solution https://github.com/ebay/griffin
  • 15. Data Quality Framework in Griffin DefineMeasureAnalyze • Define Data Quality Dimensions • Define Metrics, Goals, Thresholds Calculators running on Source RDBMS Accuracy Completeness Uniqueness Timeliness Validity Consistency Metrics Metrics Repository Scorecards • Scorecard Reports generated and displayed • Measurement values and quality scores calculated and stored • Data quality trending graphs generated Measure Repository
  • 18. Measure Calculator Example – Accuracy of Viewitem ~300M customer view events per day Accuracy Calculator Metric Target Source Item_view • User Id • Page Id • Site Id • Title • Date • …… 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) = 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … ) 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒) X 100%
  • 19. Use Cases • Griffin has been deployed in production at eBay and provided the centralized data quality service for several eBay systems. ~1.2PB 800+M 100+ Data Daily Records Metrics
  • 20. Now life is easier……
  • 21. Demo
  • 22. We are open source and welcome contributions Github: https://github.com/eBay/griffin Blog: http://www.ebaytechblog.com/?p=5877/ Contact: ebay-griffin-dev@googlegroups.com

Editor's Notes

  1. Business challenges here Previously, we maintain a lot of scripts to measure data quality, they are most run daily to generate the metrics. It’s often too late to recognize data quality issues.
  2. Completeness •Are all data sets and data items recorded? Uniqueness •Is there a single view of the data set? Timeliness •Does the data represent reality from the required point in time? Validity •Does the data match the rules? Accuracy •Does the data reflect the data set? Consistency •Can we match the data set across data stores? In Griffin, we’ve implemented “Accuracy”. Need more volunteers to support the other dimensions.
  3. Griffin covers the stages of Define, Measure, Analyze. Define: select dimension, define measurement, threshold Measure: calculate the metrics, generate reports, scorecards Analyze: download sample data to do RCA of DQ issues
  4. What do you mean by “Data Quality measurement” here?
  5. Now let’s take a look at the flow to do data quality assessment in griffin, the first step is to define the measurement. You need to decide the data quality dimensions you’d like to choose, the metrics, goals and thresholds. After that, it comes to the “Measure” stage, the key component is measure calculators running on spark. Each measurement type need a dedicated calculator. The calculators pick up the datasets from data sources, do calculations and produce metrics values. Finally, with the metrics values, we can generate scorecards and graphs, so that the end users can analyze the data qualities of their data.
  6. Real-time : the measure calculators are based on Kafka streaming, so they can handle near real-time data Fast : we have optimized algorithms for data quality domain, to support various data quality dimensions Massive : We support large scale data hosted on Spark Cluster Pluggable: It’s easy to plug a new measurement type, you just need to deploy your customized jar package in griffin to support new dimensions.
  7. Here’s the diagram of the component design in griffin. All the measurement definitions are stored in measure repository, we have a scheduler to pickup measure jobs according to the definition of the measurement. Then, a measure launcher will be started to process the measure job, it sends out the job to measure calculators in asynchronized way. When the calculations are completed, the launcher will be notified and the metrics values will be persisted. With the metrics values, we can monitor the dq, send out alerts, or generate reports/scorecards. (the reports also need the information from measure definition)
  8. As mentioned previously, the calculators are key components in griffin, so I’d like to give you an example of how calculator works. This’s an example of accuracy calculator for viewitem. Viewitem is event data from ebay sites, it represent users’ view activities on these sites, we have some internal processes to persist such kind of event data. To access the accuracy of the persisted data, we need to compare the data with their source, which are streaming data. The accuracy calculator can pick up both streaming and batch data, when the viewitem data is ready, the calculator runs the algorithm on spark and produce metrics values.
  9. Totally it deals with about 1.2 PB data, and the data volume is still increasing as long as we support more systems. there are more than 800 million daily records, totally we’ve on boarded more than 100 metrics