Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Growing into a proactive Data Platform

728 views

Published on

In this Meetup Yaar Reuveni – Team Leader & Nir Hedvat – Software Engineer from Liveperson Data Platform R&D team, will talk about the journey we made from early days of the data platform in production with high friction and low awareness to issues into a mature, measurable data platform that is visible and trustworthy.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Growing into a proactive Data Platform

  1. 1. Yaar Reuveni & Nir Hedvat Becoming a Proactive Data Platform
  2. 2. Yaar Reuveni • 6 Years at Liveperson • 1 Reporting & BI • 3 Data Platform • 2 Data Platform team lead • I love to travel • And
  3. 3. Nir Hedvat • Software Engineer B.Sc • 3 years as a C++ Developer at IBM Rational Rhapsody™ • 1.5 years at LivePerson • Cloud and Parallel Computing Enthusiast • Love Math and Powerlifting
  4. 4. Agenda • Our Scale & Operation • Evolution in becoming proactive i. Hope & Low awareness ii. Storming & Troubleshooting iii. Fortifying iv. Internalization & Comprehension v. Being Proactive • Showcases • Implementation
  5. 5. Our Scale • 2 M Daily chats • 100 M Daily monitored visitor sessions • 20 B Events per day • 2 TB Raw data per day • 2 PB Total in Hadoop clusters • Hundreds producers * event types * consumers
  6. 6. LivePerson technology stack
  7. 7. Stage 1: Hope & Low awareness We built it and it’s awesome Online producer Offline producer local files DSPT Jobs Raw Data * DSPT - Data single point of truth
  8. 8. Stage 1: Hope & Low awareness We’ve got customers Dashboards Data Science Apps Reporting Data ScienceData Access Ad-Hoc Queries
  9. 9. Stage 2: Storming & Troubleshooting You’ve got NOC & SCS on speed dial Issues arise: • Data loss • Data delays • Partial data out of frame • Missing/faulty calculations for consumers • One producer does not send for over a week
  10. 10. Stage 2: Storming & Troubleshooting You’ve got NOC & SCS on speed dial Common issues types and generators: • Hadoop ops • Production ops • Events schema • New data producers • High new features rate (LE2.0) • Data stuck in pipeline • Bugs
  11. 11. Stage 3: Fortifying Every interruption derives a new protection
  12. 12. Stage 3: Fortifying Every interruption derives a new protection
  13. 13. Stage 3: Fortifying Every interruption derives a new protection • Monitors on jobs, failures, success rate • Monitors on service status • Simple data freshness checks e.g. measure the newest event • Measure latency of specific parts of the pipeline
  14. 14. Stage 4: Internalization & Comprehension Auditing requirements • Measure principles: – Loss • How much? • Which customer? • What Type? • Where in the pipeline? – Freshness • Percentiles • Trends – Statistics • Event type count • Event per LP customer • Trends
  15. 15. Producer Audit DB Audit Aggregator Audit Loader Stage 4: Internalization & Comprehension Auditing architecture Producer Producer Events Audit Events Control Freshness
  16. 16. Stage 4: Internalization & Comprehension Mechanism Data Common Header Audit Header 1. Enrich events with audit metadata Control Event - Audit aggregation Common Header Audit Header 2. Send control events per x minutes
  17. 17. Stage 4: Internalization & Comprehension Mechanism Data Common Header Data Common Header Data Common Header Data Common Header Data Common Header Data Common Header Audit Header Control Event - Audit aggregation Common Header Audit Header Control Event - Audit aggregation Common Header Audit Header Data Common Header Audit Header Data Common Header Audit Header Data Common Header Audit Header Data Common Header Audit Header Data Common Header Old Data Flow Audited Data Flow
  18. 18. Stage 4: Internalization & Comprehension How to measure loss? • Tag all events going through our API with an auditing header: <host_name>:<bulk_id>:<sequence_id> When: • host_name - the logical identification of the producer server • bulk_id - an arbitrary unique number that should identify a bulk (changes every X minutes) • sequence_id - auto incremented persistent number used to identify missing bulks • Every X minutes send an audit control event: { eventType: AuditControlEvent, Bulks: [{bulk_id:“srv-xyz:111:97”, data_tier:”shark producer”, total_count:785}, {bulk_id:“srv-xyz:112:98”, data_tier:”shark producer”, total_count:1715}] }
  19. 19. Stage 4: Internalization & Comprehension What’s next? • Immediate gain: enables research loss straight on the raw data Next: • Count events per auditing bulk • Load into some DB for dashboarding: In this example, assuming you look at the table after 11:34, and we refer to more than 3 hours as loss, we can see that from server srv-xyz at bulk_id 1a2b3c we can see 750 events were created and only 405+250 = 655 events arrived within 3 hours this means we can detect a loss of 95 events from this server. Audit metadata Data Tier Insertion time Events count srv-xyz:1a2b3c:25 Producer 08:34 750 srv-xyz:1a2b3c:25 HDFS 09:05 405 srv-xyz:1a2b3c:25 HDFS 10:13 250
  20. 20. Stage 4: Internalization & Comprehension How to measure freshness? • Run incremental on the raw data • Group events by – Total – Event type – LP customer • Per event calculate Insertion time - creation time • Per group: – Total count – Min, max & average – Count into time buckets (0-30; 30-60; 60-120; 120-∞)
  21. 21. Stage 5: Being Proactive Tools - loss dashboard
  22. 22. Stage 5: Being Proactive Tools - loss detailed dashboard
  23. 23. Stage 5: Being Proactive Tools - loss trends
  24. 24. Stage 5: Being Proactive Tools - freshness
  25. 25. Stage 5: Being Proactive Tools - freshness
  26. 26. Stage 5: Being Proactive Tools - data statistics
  27. 27. Showcase I Bug in a new producer
  28. 28. Showcase II Deployment issue • Constant loss • Only in one farm • Depends on traffic • Only a specific producer type • From all of its nodes
  29. 29. Showcase III Consumer jobs issues • Our auditing detected a loss in Alpha • Data stuck in a job failure dir • Functional monitoring missed it • We streamed the stuck data
  30. 30. Showcase IV Producer issues • Offline producer gets stuck • Functional monitoring misses
  31. 31. Implementation Auditing architecture Producer Audit DB Audit Aggregator Audit Loader Producer Producer Events Audit Events Control Freshness
  32. 32. Implementation Auditing architecture Producer Audit DB Audit Aggregator Audit Loader Producer Producer Events Audit Events Control Freshness
  33. 33. • Storm topology • Load audit events from Kafka to MySql Bulk Tier TS Count xyz:123 WRPA 08:34 750 xyz:123 DSPT 09:05 405 xyz:123 DSPT 10:13 250 Implementation Audit Loader Audit DB Audit Loader Audit Events
  34. 34. Implementation Auditing architecture Producer Audit DB Audit Aggregator Audit Loader Producer Producer Events Audit Events Control Freshness
  35. 35. • Load data from HDFS • Aggregate events according to audit metadata • Save aggregated audit data to MySql • Spark implementation Implementation Audit Aggregator
  36. 36. HDFS DB Data Aggregate #1 #2 #3 ∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3 Collect & Save ZooKeeper Offset Audit Aggregator job First Generation
  37. 37. • Our jobs work incrementally or manually • Offset management by ZooKeeper • Failing during saving stage leads to lost offset • Saving data and offset on same stream Audit Aggregator job Overcoming Pitfalls
  38. 38. Audit Aggregator job Revised Design HDFS DB Aggregate #1 #2 #3 ∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3 Collect & Save Data Offset Bulk Tier TS Count xyz:123 WRPA 08:34 750 xyz:123 DSPT 09:05 405 xyz:123 DSPT 10:13 250
  39. 39. • Precedent - Spark Streaming for online auditing • We see our future with Spark • Cluster utilization • Performance – In-memory computation – Supports multiple shuffles – Unified data processing: batch/streaming Audit Aggregator job Why Spark
  40. 40. Implementation Auditing architecture Producer Audit DB Audit Aggregator Audit Loader Producer Producer Events Audit Events Control Freshness
  41. 41. • End-to-end latency assessment • Freshness per criteria • Output - various stats Implementation Data Freshness
  42. 42. Freshness job Design Map Reduce HDFS Total LP Customer Event Type Min Max Avg BucketsCount Event Event Event Event
  43. 43. Freshness job Mechanism • Driver – Collects LP events from HDFS • Map – Compute freshness latencies – Segmentize events per criteria by generating a composite kay • Reduce – Compute count, min, max, avg and buckets – Write stats to HDFS
  44. 44. Freshness job Output usage
  45. 45. Hadoop Platform Overcoming Pitfalls • Our data model is built over Avro • Avro comes with schema evolution • Avro data is stored along with its schema • High model-modification rate • LOBs schema changes are synchronized Producer → Consumer
  46. 46. Hadoop Platform Overcoming Pitfalls • MR/Spark job is revision-compiled when using SpecificRecord • Using GenericRecord removes the burden of recompiling each time schema changes
  47. 47. Implementation Auditing architecture Producer Audit DB Audit Aggregator Audit Loader Producer Producer Events Audit Events Control Freshness
  48. 48. THANK YOU! We are hiring
  49. 49. YouTube.com/LivePersonDev Twitter.com/LivePersonDev Facebook.com/LivePersonDev Slideshare.net/LivePersonDev

×