Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lambda-less Stream Processing
@Scale in LinkedIn
Yi Pan (Apache Samza PMC/Committer)
Kartik Paramasivam (Mgr -Streams Infr...
Agenda
• Rise of Stream Processing Applications
• Some Hard Problems in Stream Processing
–Data Accuracy
–Reprocessing
• C...
Newsfeed
Cyber-security
Internet of Things
Agenda
• Rise of Stream Processing Applications
• Some Hard Problems in Stream Processing
–Data Accuracy
–Reprocessing
• C...
Data Accuracy
• Can Stream Processing generate accurate
results?
–Yes.. but it is not trivial.
Case Study
Ads
HTML
1:00pm
AdViewEvents
AdQuality processor
Case Study
Ads
HTML
1:01pm
AdViewEvents
AdQuality processor
AdClickEvents
Case Study
Ads
HTML
1:01pm
AdViewEvents
AdQuality processor
AdClickEvents
Did AdClick
happen
within 2min
of AdView?
YesNo
...
Delays in Event
Stream
Ad Quality
Processor
(Samza)
Services Tier
Kafka
Services Tier
Ad Quality
Processor
(Samza)
KafkaMi...
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 ...
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 ...
Lambda at
LinkedIn
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldem
ort R/O
Processing
Bulk
upload
Espr...
• Basic Assumption : Batch jobs have full data-
set
• But, how about edges?
Data Accuracy - with Lambda
Smaller batch size...
Fixing Lambda
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldemort
R/O
Processing
Bulk
upload
Espresso
S...
Observation
• Data Accuracy is still very hard with Lambda
–Additional system (e.g. Kafka Audit) has to be
used to safely ...
Going Lambda-less
• Handle late arrivals and out of order arrivals
• Eventually correct results
– Compute results at end o...
Going Lambda-less
AdViewEvent
AdClickEvent
AdQuality processor
1:00pm1:01pm1:01pm1:02pm1:02pm
1:00pm1:02pm
Window output i...
Handling ‘late arrival’
1:00pm1:01pm1:01pm1:02pm1:02pm
1:00pm1:02pm
1:01pm
Late-arrival
Re-compute
window(“1:00pm”, “2min”...
Handling ‘out of order arrival’
1:01pm1:02pm
1:00pm1:02pm
null join result in
window(“1:00pm”, “2min”)
Kafka
Kafka
AdViewE...
Handling ‘out of order arrival’
1:01pm1:02pm1:00pm1:01pm
1:00pm1:02pm
Re-compute
window(“1:00pm”, “2min”)
Out-of-order arr...
SamzaContainer-1
Samza based Solution
Kafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Events are saved into Rocks...
SamzaContainer-1
Performance
Kafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Performance of Samza’s local RocksDB...
Agenda
• Rise of Stream Processing Applications
• Some Hard Problems in Stream Processing
– Data Accuracy
–Reprocessing
• ...
Reprocessing
• What is reprocessing ?
–Process events that happened in the past.
Case Study : Title Standardization
LinkedIn
Profile
change ‘Title’ :
Before: Architect
After: Chief
Technology
Nerd
Title
...
Title Standardizer -
Implementation
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Machine ...
Reprocessing - dealing with bugs
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Kafka
Datab...
Reprocessing - entire Dataset
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Kafka
Databus
...
Reprocessing - entire Dataset
Profile
Updates
(Samza) Title-
Standardizer
(Samza) Title-
Standardizer
Bootstrap
Backup Mac...
Reprocessing - entire Dataset
Profile
Updates
(Samza) Title-
Standardizer
(Samza) Title-
Standardizer
BootstrapBackup
Mach...
Reprocessing- Caveats
• Stream processors are fast.. They can DOS the
system if you reprocess
– Control max-concurrency of...
Reprocessing larger datasets
Profile
Updates
(Samza) Title-
Standardizer
Machine Learning
model
output
Kafka
Databus
(Samz...
Experimentation
Database
Dump in HDFS
(Samza) Title-
Standardizer
Hadoop
ML Model in
HDFS
Output in HDFS
● Offline experim...
Conclusion
1.It is possible to avoid code
duplication(hot/cold path) to support
– Accuracy
–Reprocessing
2. Some Lambda re...
References
• MillWheel: http://research.google.com/pubs/pub41378.html
• DataFlow: http://research.google.com/pubs/pub41378...
Thank You!
Upcoming SlideShare
Loading in …5
×

Lambda-less Stream Processing @Scale in LinkedIn

1,390 views

Published on

Lambda-less Stream Processing @Scale in LinkedIn

Published in: Technology
  • Be the first to comment

Lambda-less Stream Processing @Scale in LinkedIn

  1. 1. Lambda-less Stream Processing @Scale in LinkedIn Yi Pan (Apache Samza PMC/Committer) Kartik Paramasivam (Mgr -Streams Infra) June, 2016
  2. 2. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing –Data Accuracy –Reprocessing • Conclusion
  3. 3. Newsfeed
  4. 4. Cyber-security
  5. 5. Internet of Things
  6. 6. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing –Data Accuracy –Reprocessing • Conclusion
  7. 7. Data Accuracy • Can Stream Processing generate accurate results? –Yes.. but it is not trivial.
  8. 8. Case Study Ads HTML 1:00pm AdViewEvents AdQuality processor
  9. 9. Case Study Ads HTML 1:01pm AdViewEvents AdQuality processor AdClickEvents
  10. 10. Case Study Ads HTML 1:01pm AdViewEvents AdQuality processor AdClickEvents Did AdClick happen within 2min of AdView? YesNo Good AdBad Ad
  11. 11. Delays in Event Stream Ad Quality Processor (Samza) Services Tier Kafka Services Tier Ad Quality Processor (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdViewEvent LB
  12. 12. Real Time Processing (Samza) Services Tier Kafka Services Tier Real Time Processing (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdClick Event LB Delays in Event Stream Late Arrival
  13. 13. Real Time Processing (Samza) Services Tier Kafka Services Tier Real Time Processing (Samza) KafkaMirrored Yi DATACENTER 1 DATACENTER 2 AdClick Event LB Delays in Event stream Out of Order Arrival
  14. 14. Lambda at LinkedIn Real Time Processing (Samza) Batch Processing (Hadoop/Spark) Voldem ort R/O Processing Bulk upload Espresso Services Tier Ingestion Serving Clients(browser,devices,..) Kafka
  15. 15. • Basic Assumption : Batch jobs have full data- set • But, how about edges? Data Accuracy - with Lambda Smaller batch size == more edges! Graph kudos to Stream Processing 101 from Tyler Akidau (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) 10:00 11:00 12:00 13:00 system time
  16. 16. Fixing Lambda Real Time Processing (Samza) Batch Processing (Hadoop/Spark) Voldemort R/O Processing Bulk upload Espresso Services Tier Ingestion Serving Clients(browser,devices, ….) Kafka Kafka Audit Check Safe Start Time
  17. 17. Observation • Data Accuracy is still very hard with Lambda –Additional system (e.g. Kafka Audit) has to be used to safely start the batch jobs • Duplication in Online/Offline system: –Development cost –Operational overhead –Maintenance overhead
  18. 18. Going Lambda-less • Handle late arrivals and out of order arrivals • Eventually correct results – Compute results at end of ‘window’. – Re-compute when events arrives late • Influenced by “Google MillWheel”
  19. 19. Going Lambda-less AdViewEvent AdClickEvent AdQuality processor 1:00pm1:01pm1:01pm1:02pm1:02pm 1:00pm1:02pm Window output is computed at the end of window = (2min after the window is created) window(“1:00pm”, “2min”) Kafka Kafka
  20. 20. Handling ‘late arrival’ 1:00pm1:01pm1:01pm1:02pm1:02pm 1:00pm1:02pm 1:01pm Late-arrival Re-compute window(“1:00pm”, “2min”) Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  21. 21. Handling ‘out of order arrival’ 1:01pm1:02pm 1:00pm1:02pm null join result in window(“1:00pm”, “2min”) Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  22. 22. Handling ‘out of order arrival’ 1:01pm1:02pm1:00pm1:01pm 1:00pm1:02pm Re-compute window(“1:00pm”, “2min”) Out-of-order arrival Kafka Kafka AdViewEvent AdClickEvent AdQuality processor
  23. 23. SamzaContainer-1 Samza based Solution Kafka AdClicks SamzaContainer-0 Task1 Task2 Task3 AdView Events are saved into RocksDB based local message store which is backed up durably in Kafka Kafka Samza Job Changelog in Kafka
  24. 24. SamzaContainer-1 Performance Kafka AdClicks SamzaContainer-0 Task1 Task2 Task3 AdView Performance of Samza’s local RocksDB store: - 1.1 Million TPS (read/write) on single machine (ssd) - Largest production job has 1.5 Terabyte of local state Kafka Samza Job Changelog in Kafka
  25. 25. Agenda • Rise of Stream Processing Applications • Some Hard Problems in Stream Processing – Data Accuracy –Reprocessing • Conclusion
  26. 26. Reprocessing • What is reprocessing ? –Process events that happened in the past.
  27. 27. Case Study : Title Standardization LinkedIn Profile change ‘Title’ : Before: Architect After: Chief Technology Nerd Title Standardizer Search Ads ….
  28. 28. Title Standardizer - Implementation output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Machine Learning model Kafka Databus
  29. 29. Reprocessing - dealing with bugs output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Kafka Databus rewind 4 hours Machine Learning model
  30. 30. Reprocessing - entire Dataset output Member Database (espresso) Profile Updates (Samza) Title- Standardizer Kafka Databus Bootstrap Backup Database Backup (NFS) set offset=0 Machine Learning model (NEW)
  31. 31. Reprocessing - entire Dataset Profile Updates (Samza) Title- Standardizer (Samza) Title- Standardizer Bootstrap Backup Machine Learning model (NEW) output Kafka Databus Databus Member Database (espresso) Database Backup (NFS) set offset=0
  32. 32. Reprocessing - entire Dataset Profile Updates (Samza) Title- Standardizer (Samza) Title- Standardizer BootstrapBackup Machine Learning model (NEW) output Kafka Databus Databus (Samza) Merge and Store Results
  33. 33. Reprocessing- Caveats • Stream processors are fast.. They can DOS the system if you reprocess – Control max-concurrency of your job – Quotas for Kafka, Databases • Reprocessing a 100 TB source ? –Capacity ? –Saturation of NICs, Top of rack switches
  34. 34. Reprocessing larger datasets Profile Updates (Samza) Title- Standardizer Machine Learning model output Kafka Databus (Samza) Merge and Store Results Database Dump in HDFS (Samza) Title- Standardizer ML Model in HDFS Hadoop
  35. 35. Experimentation Database Dump in HDFS (Samza) Title- Standardizer Hadoop ML Model in HDFS Output in HDFS ● Offline experimentation before pushing the logic online ○ Most datasets are already available in Hadoop (at LinkedIn) ○ Fast Iteration with minimum impact to production
  36. 36. Conclusion 1.It is possible to avoid code duplication(hot/cold path) to support – Accuracy –Reprocessing 2. Some Lambda related problems still linger when reprocessing entire datasets –e.g. merging online/reprocessing results
  37. 37. References • MillWheel: http://research.google.com/pubs/pub41378.html • DataFlow: http://research.google.com/pubs/pub41378.html • Samza: http://samza.apache.org/ • Window Operator in Samza: https://issues.apache.org/jira/browse/SAMZA-552 • Lambda Architecture: https://www.manning.com/books/big-data • Stream Processing 101: https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101 • Stream Processing 102: https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-102
  38. 38. Thank You!

×