Successfully reported this slideshow.

Lambda-less Stream Processing @Scale in LinkedIn

4

Share

Loading in …3
×
1 of 38
1 of 38

More Related Content

Similar to Lambda-less Stream Processing @Scale in LinkedIn

More from DataWorks Summit/Hadoop Summit

Related Books

Free with a 30 day trial from Scribd

See all

Lambda-less Stream Processing @Scale in LinkedIn

  1. 1. Lambda-less Stream Processing @Scale in LinkedIn Yi Pan (Apache Samza PMC/Committer) Kartik Paramasivam (Mgr -Streams Infra) June, 2016
  2. 2. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing –Data Accuracy –Reprocessing • Conclusion
  3. 3. Newsfeed
  4. 4. Cyber-security
  5. 5. Internet of Things
  6. 6. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing –Data Accuracy –Reprocessing • Conclusion
  7. 7. Data Accuracy • Can Stream Processing generate accurate results? –Yes.. but it is not trivial.
  8. 8. Case Study Ads HTML 1:00pm AdViewEvents AdQuality processor
  9. 9. Case Study Ads HTML 1:01pm AdViewEvents AdQuality processor AdClickEvents
  10. 10. Case Study Ads HTML 1:01pm AdViewEvents AdQuality processor AdClickEvents Did AdClick happen within 2min of AdView? YesNo Good AdBad Ad
  11. 11. Delays in Event Stream Ad Quality Processor (Samza) Services Tier Kafka Services Tier Ad Quality Processor (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdViewEvent LB
  12. 12. Real Time Processing (Samza) Services Tier Kafka Services Tier Real Time Processing (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdClick Event LB Delays in Event Stream Late Arrival
  13. 13. Real Time Processing (Samza) Services Tier Kafka Services Tier Real Time Processing (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdClick Event LB Delays in Event stream Out of Order Arrival
  14. 14. Lambda at LinkedIn Real Time Processing (Samza) Batch Processing (Hadoop/Spark) Voldem ort R/O Processing Bulk upload Espresso Services Tier Ingestion Serving Clients(browser,devices,..) Kafka
  15. 15. • Basic Assumption : Batch jobs have full data- set • But, how about edges? Data Accuracy - with Lambda Smaller batch size == more edges! Graph kudos to Stream Processing 101 from Tyler Akidau (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) 10:00 11:00 12:00 13:00 system time
  16. 16. Fixing Lambda Real Time Processing (Samza) Batch Processing (Hadoop/Spark) Voldemort R/O Processing Bulk upload Espresso Services Tier Ingestion Serving Clients(browser,devices, ….) Kafka Kafka Audit Check Safe Start Time
  17. 17. Observation • Data Accuracy is still very hard with Lambda –Additional system (e.g. Kafka Audit) has to be used to safely start the batch jobs • Duplication in Online/Offline system: –Development cost –Operational overhead –Maintenance overhead
  18. 18. Going Lambda-less • Handle late arrivals and out of order arrivals • Eventually correct results – Compute results at end of ‘window’. – Re-compute when events arrives late • Influenced by “Google MillWheel”
  19. 19. Going Lambda-less AdViewEvent AdClickEvent AdQuality processor 1:00pm1:01pm1:01pm1:02pm1:02pm 1:00pm1:02pm Window output is computed at the end of window = (2min after the window is created) window(“1:00pm”, “2min”) Kafka Kafka
  20. 20. Handling ‘late arrival’ 1:00pm1:01pm1:01pm1:02pm1:02pm 1:00pm1:02pm 1:01pm Late-arrival Re-compute window(“1:00pm”, “2min”) Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  21. 21. Handling ‘out of order arrival’ 1:01pm1:02pm 1:00pm1:02pm null join result in window(“1:00pm”, “2min”) Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  22. 22. Handling ‘out of order arrival’ 1:01pm1:02pm1:00pm1:01pm 1:00pm1:02pm Re-compute window(“1:00pm”, “2min”) Out-of-order arrival Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  23. 23. SamzaContainer-1 Samza based Solution Kafka AdClicks SamzaContainer-0 Task1 Task2 Task3 AdView Events are saved into RocksDB based local message store which is backed up durably in Kafka Kafka Samza Job Changelog in Kafka
  24. 24. SamzaContainer-1 Performance Kafka AdClicks SamzaContainer-0 Task1 Task2 Task3 AdView Performance of Samza’s local RocksDB store: - 1.1 Million TPS (read/write) on single machine (ssd) - Largest production job has 1.5 Terabyte of local state Kafka Samza Job Changelog in Kafka
  25. 25. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing – Data Accuracy –Reprocessing • Conclusion
  26. 26. Reprocessing • What is reprocessing ? –Process events that happened in the past.
  27. 27. Case Study : Title Standardization LinkedIn Profile change ‘Title’ : Before: Architect After: Chief Technology Nerd Title Standardizer Search Ads ….
  28. 28. Title Standardizer - Implementation output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Machine Learning model Kafka Databus
  29. 29. Reprocessing - dealing with bugs output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Kafka Databus rewind 4 hours Machine Learning model
  30. 30. Reprocessing - entire Dataset output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Kafka Databus Bootstrap Backup Database Backup (NFS) set offset=0 Machine Learning model (NEW)
  31. 31. Reprocessing - entire Dataset Profile Updates (Samza) Title- Standardizer (Samza) Title- Standardizer Bootstrap Backup Machine Learning model (NEW) output Kafka Databus Databus Member Database (espresso) Database Backup (NFS) set offset=0
  32. 32. Reprocessing - entire Dataset Profile Updates (Samza) Title- Standardizer (Samza) Title- Standardizer BootstrapBackup Machine Learning model (NEW) output Kafka Databus Databus (Samza) Merge and Store Results
  33. 33. Reprocessing- Caveats • Stream processors are fast.. They can DOS the system if you reprocess – Control max-concurrency of your job – Quotas for Kafka, Databases • Reprocessing a 100 TB source ? –Capacity ? –Saturation of NICs, Top of rack switches
  34. 34. Reprocessing larger datasets Profile Updates (Samza) Title- Standardizer Machine Learning model output Kafka Databus (Samza) Merge and Store Results Database Dump in HDFS (Samza) Title- Standardizer ML Model in HDFS Hadoop
  35. 35. Experimentation Database Dump in HDFS (Samza) Title- Standardizer Hadoop ML Model in HDFS Output in HDFS ● Offline experimentation before pushing the logic online ○ Most datasets are already available in Hadoop (at LinkedIn) ○ Fast Iteration with minimum impact to production
  36. 36. Conclusion 1.It is possible to avoid code duplication(hot/cold path) to support – Accuracy –Reprocessing 2. Some Lambda related problems still linger when reprocessing entire datasets –e.g. merging online/reprocessing results
  37. 37. References • MillWheel: http://research.google.com/pubs/pub41378.html • DataFlow: http://research.google.com/pubs/pub41378.html • Samza: http://samza.apache.org/ • Window Operator in Samza: https://issues.apache.org/jira/browse/SAMZA-552 • Lambda Architecture: https://www.manning.com/books/big-data • Stream Processing 101: https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101 • Stream Processing 102: https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-102
  38. 38. Thank You!

×