Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 14

Real time data quality on Flink

0

Share

Download to read offline

My use case is to provide monitoring, and improving the overall search data quality, also to find the unusual patterns of user’s search behavior, and notifying the intent on-site back to the respective business stakeholders. To achieve the same, I explored various big data processing engines, which can process the huge data with complex business logic in real time. Eventually, I used Flink Stream processing. This talk will showcase how I used Flink to accomplish my goal.

Real time data quality on Flink

  1. 1. Real Time DQMM on Flink Jaydeep Staff Engineer in Search Team Apache Oozie Committer June, 2019
  2. 2. Table of Contents 2 • What is Real Time Aggregation​? • Use Case • What we deal with? • System Requirements • Spark vs Flink • Flink Cluster setup • Flink on Yarn • Architecture • 100% data completeness • Open Items
  3. 3. What is Real Time Aggregation​? 3 • What is real time ?​ • What is the processing delay today?​ • What real time offering?​ • Why do we need it?
  4. 4. Use Case 4 • Bug detection in Response log • Bot detection • Best Seller Item • Item Catalogue health​ • Item out of stock (specially on event days)​ • Best seller item tracking​ • Top query monitoring​ • Category performance
  5. 5. What we deal with? 5 ~4 Billion logs Per day ~8 million records per minutes ~800 GB Data Per day
  6. 6. System Requirements 6 • Support for Real-time processing. • Support to track the events. • Easy to recover from failure. • Exactly once processing • Backpressure handling • Support for Event based, Time based and Dynamic Window • Highly Available
  7. 7. Spark vs Flink 7 Criteria Spark Flink Data Processing Mini Batch Stream Processing Data Shuffling Polling Trigger Window Function Time Based Time/Event/Custom Memory Management Configurable Auto Managed Recovery DAG level State level Re-Utilization and Iteration By Stage By event
  8. 8. Flink Cluster setup 8 • Standalone • Flink on Mesos • Flink on Yarn
  9. 9. Flink on Yarn 9
  10. 10. Architecture 10
  11. 11. 100% Data Completeness 11 Event Arrival Time Actual Event Time Clicks 2019-06-01 10:01:00 2019-06-01 10:01:00 3 2019-06-01 10:02:00 2019-06-01 10:02:00 1 2019-06-01 10:04:00 2019-06-01 10:03:00 4 2019-06-01 10:06:00 2019-06-01 10:04:00 5 2019-06-01 10:08:00 2019-06-01 10:04:00 1 Processed Time Event time Window Clicks 2019-06-01 10:05:00 2019-06-01 10:05:00 8 2019-06-01 10:10:00 2019-06-01 10:10:00 6
  12. 12. 100% Data Completeness 12 • Event Time data processing • Handling the delayed event • Prevent false anomaly detection • Probability based Model for data completeness
  13. 13. Open Items 13 • Real time Model training • Handling Seasonality while detecting Anomaly
  14. 14. Walmart Labs – Privileged and Confidential14

×