Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using Hadoop to build a Data Quality Service for both real-time and batch data

Check these out next

1 of 22 Ad
1 of 22 Ad
Advertisement

More Related Content

Slideshows for you (19)

Advertisement

More from DataWorks Summit/Hadoop Summit (20)

Advertisement

Using Hadoop to build a Data Quality Service for both real-time and batch data

  1. 1. Using Hadoop to build a Data Quality Service for both real-time and batch data Griffin – https://github.com/ebay/griffin
  2. 2. About Us: • Alex Lv (lzhixing@ebay.com) Senior Staff Software Engineer – Data Products Platform & Engineering at eBay • Amber Vaidya (amvaidya@ebay.com) Lead Product Manager - Data Products Platform & Engineering at eBay
  3. 3. Agenda • Background • Introduction to Griffin • Demo
  4. 4. Background
  5. 5. eBay Marketplace at a Glance Q2 2016 data $19.8B GMV in Q2 2016 10M New listings added via mobile per week 300M Searches each day 65% Transactions that ship for free (in US, UK, DE) 80% Items sold as new 1B Live listings One of the world’s largest and most vibrant marketplaces
  6. 6. Velocity Stats US 3 car parts or accessories are sold every A smartphone is sold every A dress is sold every 1 sec 4 sec 6 sec UK A necklace is sold every A make-up product is sold every A Lego product is sold every 10 sec 3 sec 19 sec GERMANY A truck or car is sold every A pair of women’s jeans is sold every A video game is sold every 5 min 4 sec 11 sec AUSTRALIA A pair of men’s sunglasses is sold every A home décor item is sold every A car or truck part is sold every 1 min 12 sec 4 sec
  7. 7. Mobile Velocity Stats US A woman’s handbag is sold every A car or truck is sold every An action figure is sold every 10 sec 5 min 10 sec UK A tablet is sold every A cookware item is sold every A car is sold every 1 min 6 sec 2 min GERMANY A pair of women’s shoes is sold every A watch is sold every A tire or car part is sold every 20 sec 48 sec 35 sec AUSTRALIA A piece of jewelry is sold every A baby clothing item is sold every A motorcycle part is sold every 12 sec 46 sec 51 sec
  8. 8. Big Data @ We manage one of the largest data platforms in the world We utilize one of the largest data platforms in the world
  9. 9. Challenging to ensure data quality for such scale! Challenges at eBay: • No unified view of data quality across multiple systems and teams • No shared platform to manage data quality • No system to measure near real-time data quality
  10. 10. What is Data Quality? Definition • How well it meets the expectations of data consumers • How well it represents the objects, events, and concepts it is created to represent Dimensions Completeness Uniqueness Timeliness Validity Accuracy Consistency Core Dimensions
  11. 11. Virtuous Cycle of Data Quality Define Measure AnalyzeImprove • Define the scope, dimensions, goals, thresholds, etc. • Measure data quality values • Analyze data quality results • Improve data quality
  12. 12. Our Goal A solution with all the below capabilities Capability Commercial DQ software Open source DQ software Support eBay’s scale x x Data Quality measurement √ x Support real-time data x x Support unstructured data x x Service based API √ x Data Profiling √ √ Pluggable measurement types x x
  13. 13. Griffin
  14. 14. What is Griffin? • Data Quality Platform built on Hadoop and Spark  Batch data  Real-time data  Unstructured data • A unified process to detect DQ issues  Incomplete  Inaccurate  Invalid  …… • An open source solution https://github.com/ebay/griffin
  15. 15. Data Quality Framework in Griffin DefineMeasureAnalyze • Define Data Quality Dimensions • Define Metrics, Goals, Thresholds Calculators running on Source RDBMS Accuracy Completeness Uniqueness Timeliness Validity Consistency Metrics Metrics Repository Scorecards • Scorecard Reports generated and displayed • Measurement values and quality scores calculated and stored • Data quality trending graphs generated Measure Repository
  16. 16. Technical Highlights
  17. 17. Component Design
  18. 18. Measure Calculator Example – Accuracy of Viewitem ~300M customer view events per day Accuracy Calculator Metric Target Source Item_view • User Id • Page Id • Site Id • Title • Date • …… 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) = 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … ) 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒) X 100%
  19. 19. Use Cases • Griffin has been deployed in production at eBay and provided the centralized data quality service for several eBay systems. ~1.2PB 800+M 100+ Data Daily Records Metrics
  20. 20. Now life is easier……
  21. 21. Demo
  22. 22. We are open source and welcome contributions Github: https://github.com/eBay/griffin Blog: http://www.ebaytechblog.com/?p=5877/ Contact: ebay-griffin-dev@googlegroups.com

Editor's Notes

  • Business challenges here

    Previously, we maintain a lot of scripts to measure data quality, they are most run daily to generate the metrics. It’s often too late to recognize data quality issues.
  • Completeness
    •Are all data sets and data items recorded?
    Uniqueness
    •Is there a single view of the data set?
    Timeliness
    •Does the data represent reality from the required point in time?
    Validity
    •Does the data match the rules?
    Accuracy
    •Does the data reflect the data set?
    Consistency
    •Can we match the data set across data stores?


    In Griffin, we’ve implemented “Accuracy”. Need more volunteers to support the other dimensions.
  • Griffin covers the stages of Define, Measure, Analyze.
    Define: select dimension, define measurement, threshold
    Measure: calculate the metrics, generate reports, scorecards
    Analyze: download sample data to do RCA of DQ issues
  • What do you mean by “Data Quality measurement” here?

  • Now let’s take a look at the flow to do data quality assessment in griffin, the first step is to define the measurement. You need to decide the data quality dimensions you’d like to choose,
    the metrics, goals and thresholds.

    After that, it comes to the “Measure” stage, the key component is measure calculators running on spark. Each measurement type need a dedicated calculator. The calculators pick up the datasets from data sources, do calculations and produce metrics values.

    Finally, with the metrics values, we can generate scorecards and graphs, so that the end users can analyze the data qualities of their data.
  • Real-time : the measure calculators are based on Kafka streaming, so they can handle near real-time data
    Fast : we have optimized algorithms for data quality domain, to support various data quality dimensions
    Massive : We support large scale data hosted on Spark Cluster
    Pluggable: It’s easy to plug a new measurement type, you just need to deploy your customized jar package in griffin
    to support new dimensions.
  • Here’s the diagram of the component design in griffin.
    All the measurement definitions are stored in measure repository, we have a scheduler to pickup measure jobs according to the definition
    of the measurement. Then, a measure launcher will be started to process the measure job, it sends out the job to measure calculators
    in asynchronized way. When the calculations are completed, the launcher will be notified and the metrics values will be persisted.
    With the metrics values, we can monitor the dq, send out alerts, or generate reports/scorecards. (the reports also need the information from measure definition)
  • As mentioned previously, the calculators are key components in griffin, so I’d like to give you an example of how calculator works.
    This’s an example of accuracy calculator for viewitem.
    Viewitem is event data from ebay sites, it represent users’ view activities on these sites, we have some internal processes to persist such kind of event data.
    To access the accuracy of the persisted data, we need to compare the data with their source, which are streaming data. The accuracy calculator can pick up both
    streaming and batch data, when the viewitem data is ready, the calculator runs the algorithm on spark and produce metrics values.
  • Totally it deals with about 1.2 PB data, and the data volume is still increasing as long as we support more systems.
    there are more than 800 million daily records, totally we’ve on boarded more than 100 metrics

×