SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
6.
Agenda
• Rise of Stream Processing Applications
• Some Hard Problems in Stream Processing
–Data Accuracy
–Reprocessing
• Conclusion
7.
Data Accuracy
• Can Stream Processing generate accurate
results?
–Yes.. but it is not trivial.
8.
Case Study
Ads
HTML
1:00pm
AdViewEvents
AdQuality processor
9.
Case Study
Ads
HTML
1:01pm
AdViewEvents
AdQuality processor
AdClickEvents
10.
Case Study
Ads
HTML
1:01pm
AdViewEvents
AdQuality processor
AdClickEvents
Did AdClick
happen
within 2min
of AdView?
YesNo
Good AdBad Ad
11.
Delays in Event
Stream
Ad Quality
Processor
(Samza)
Services Tier
Kafka
Services Tier
Ad Quality
Processor
(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdViewEvent
LB
12.
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdClick Event
LB
Delays in Event
Stream
Late
Arrival
13.
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
Yi
DATACENTER 1 DATACENTER 2
AdClick Event
LB
Delays in Event
stream
Out of
Order
Arrival
14.
Lambda at
LinkedIn
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldem
ort R/O
Processing
Bulk
upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices,..)
Kafka
15.
• Basic Assumption : Batch jobs have full data-
set
• But, how about edges?
Data Accuracy - with Lambda
Smaller batch size == more edges!
Graph kudos to Stream Processing 101 from Tyler Akidau
(https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
10:00 11:00 12:00 13:00 system time
16.
Fixing Lambda
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldemort
R/O
Processing
Bulk
upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices, ….)
Kafka
Kafka
Audit
Check
Safe Start
Time
17.
Observation
• Data Accuracy is still very hard with Lambda
–Additional system (e.g. Kafka Audit) has to be
used to safely start the batch jobs
• Duplication in Online/Offline system:
–Development cost
–Operational overhead
–Maintenance overhead
18.
Going Lambda-less
• Handle late arrivals and out of order arrivals
• Eventually correct results
– Compute results at end of ‘window’.
– Re-compute when events arrives late
• Influenced by “Google MillWheel”
19.
Going Lambda-less
AdViewEvent
AdClickEvent
AdQuality processor
1:00pm1:01pm1:01pm1:02pm1:02pm
1:00pm1:02pm
Window output is computed at the end of
window = (2min after the window is created)
window(“1:00pm”, “2min”)
Kafka
Kafka
21.
Handling ‘out of order arrival’
1:01pm1:02pm
1:00pm1:02pm
null join result in
window(“1:00pm”, “2min”)
Kafka
Kafka
AdViewEvent
AdClickEvent
AdQuality processor
22.
Handling ‘out of order arrival’
1:01pm1:02pm1:00pm1:01pm
1:00pm1:02pm
Re-compute
window(“1:00pm”, “2min”)
Out-of-order arrival
Kafka
Kafka
AdViewEvent
AdClickEvent
AdQuality processor
23.
SamzaContainer-1
Samza based Solution
Kafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Events are saved into RocksDB based local message
store which is backed up durably in Kafka
Kafka
Samza Job
Changelog
in Kafka
24.
SamzaContainer-1
Performance
Kafka
AdClicks
SamzaContainer-0
Task1
Task2
Task3
AdView
Performance of Samza’s local RocksDB store:
- 1.1 Million TPS (read/write) on single machine (ssd)
- Largest production job has 1.5 Terabyte of local state
Kafka
Samza Job
Changelog
in Kafka
25.
Agenda
• Rise of Stream Processing Applications
• Some Hard Problems in Stream Processing
– Data Accuracy
–Reprocessing
• Conclusion
26.
Reprocessing
• What is reprocessing ?
–Process events that happened in the past.
27.
Case Study : Title Standardization
LinkedIn
Profile
change ‘Title’ :
Before: Architect
After: Chief
Technology
Nerd
Title
Standardizer
Search Ads ….
28.
Title Standardizer -
Implementation
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Machine Learning
model
Kafka
Databus
29.
Reprocessing - dealing with bugs
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Kafka
Databus
rewind 4 hours
Machine Learning
model
30.
Reprocessing - entire Dataset
output
Member
Database
(espresso)
Profile
Updates
(Samza) Title-
Standardizer
Kafka
Databus
Bootstrap
Backup
Database
Backup
(NFS)
set offset=0
Machine Learning
model (NEW)
32.
Reprocessing - entire Dataset
Profile
Updates
(Samza) Title-
Standardizer
(Samza) Title-
Standardizer
BootstrapBackup
Machine Learning
model (NEW)
output
Kafka
Databus
Databus
(Samza)
Merge and
Store
Results
33.
Reprocessing- Caveats
• Stream processors are fast.. They can DOS the
system if you reprocess
– Control max-concurrency of your job
– Quotas for Kafka, Databases
• Reprocessing a 100 TB source ?
–Capacity ?
–Saturation of NICs, Top of rack switches
34.
Reprocessing larger datasets
Profile
Updates
(Samza) Title-
Standardizer
Machine Learning
model
output
Kafka
Databus
(Samza)
Merge and
Store
Results
Database
Dump in
HDFS
(Samza) Title-
Standardizer
ML Model in
HDFS
Hadoop
35.
Experimentation
Database
Dump in HDFS
(Samza) Title-
Standardizer
Hadoop
ML Model in
HDFS
Output in HDFS
● Offline experimentation before pushing the logic online
○ Most datasets are already available in Hadoop (at LinkedIn)
○ Fast Iteration with minimum impact to production
36.
Conclusion
1.It is possible to avoid code
duplication(hot/cold path) to support
– Accuracy
–Reprocessing
2. Some Lambda related problems still linger
when reprocessing entire datasets
–e.g. merging online/reprocessing results