Essential Ingredients of Realtime Stream Processing @ Scale

Essential Ingredients of Stream
Processing @ Scale
Kartik Paramasivam

About Me
• ‘Streams Infrastructure’ at LinkedIn
– Pub-sub messaging : Apache Kafka
– Change Capture from various data systems: Databus
– Stream Processing platform : Apache Samza
• Previous
– Microsoft Cloud/IOT Messaging (EventHub) and
Enterprise Messaging(Queues/Topics)
– .NET WebServices and Workflow stack
– BizTalk Server

Agenda
• What is Stream Processing ?
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close

Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms

Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close

CANONICAL
ARCHITECTURE
Data-
Bus
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldem
ort R/Oe.g.
Espresso
Processing
Bulk
upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices, sensors ….)
Kafka

Essential Ingredients to Stream
Processing
1. Scale
2. Reprocessing
3. Accuracy of results
4. Easy to program

Basics : Scaling Ingestion
- Streams are partitioned
- Messages sent to partitions
based on PartitionKey
- Time based message
retention
Stream A
producers
Pkey=10
consumerA
(machine1)
consumerA
(machine2)
Pkey=25 Pkey=45
e.g. Kafka, AWS Kinesis, Azure EventHub

Scaling Processing.. E.g. Samza
Stream A
Task 1 Task 2 Task 3
Stream B
Samza Job

Samza – Streaming Dataflow
Stream A
Stream c
Stream D
Job 1
Job 2
Stream B

Horizontal Scaling is great ! But..
• But more machines means more $$
• Need to do more with less.
• So what’s the key bottleneck during
Event/Stream Processing ?

Key Bottleneck: “Accessing Data”
• Big impact on CPU, Network, Disk
• Types of Data Access
1. Adjunct data – Read only data
2. Scratchpad/derived data - Read-Write
data

Adjunct Data – typical access
KafkaAdClicks Processing
Job
AdQuality update
Kafka
Member
Database
Read Member Info
Concerns
1. Latency
2. CPU
3. Network
4. DDOS

Scratch pad/Derived Data – typical
access
Kafka
Sensor
Data
Processing
Job
Alerts
Kafka
Device
State
Database
Concerns
1. Latency
2. CPU
3. Network
4. DDOS
Read + Update per
Device Info

Adjunct Data – with Samza
KafkaAdClicks
Processing Job
output
Kafka
Member
Database
(espresso) Databus
Kafka, Databus, Database, Samza Job are all
partitioned by MemberId
Member
Updates
Task1
Task2
Task3

Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Stable State

P0
P1
P2
P3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Host A dies/fails

P0
P1
P2
P3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
YARN allocates the
tasks to a container
on a different host!

P0
P1
P2
P3
P0
P1
P2
P3
Changelog Stream
Restore local state by
reading from the
ChangeLog

P0
P1
P2
P3
P0
P1
P2
P3
Changelog Stream
Back to Stable
State

Hardware Spec: 24 cores, 1Gig NIC, SSD
• (Baseline) Simple pass through job with no
local state
– 1.2 Million msg/sec
• Samza job with local state
– 400k msg/sec
• Samza job with local state with Kafka backup
– 300k msg/sec
Performance Numbers with Samza

Local State - Summary
• Great for both read-only data and read-write
data
• Secret sauce to make local state work
1. Change Capture System: Databus/DynamoDB
streams
2. Durable backup with Kafka Log Compacted
topics

Why do we need it ?
• Software upgrades.. Yes bugs are a reality
• Business logic changes
• First time job deployment

Reprocessing Data – with Samza
output
Kafka
Member
Database
(espresso)
Databus
Member
Updates
Company/Title/Lo
cation
StandardIzation
Job
Machine
Learning
modelbootstrap

Reprocessing- Caveats
• Stream processors are fast.. They can DOS the
system if you reprocess
– Control max-concurrency of your job
– Quotas for Kafka, Databases
– Async load into databases (Project Venice)
• Capacity
– Reprocessing a 100 TB source ?
• Doesn’t reprocessing mean you are no-longer
being real-time ?

Processing
1. Scale but at not at any cost
2. Reprocessing
4. Easy to Program

Querying over an infinite stream
1.00
pm
Ad View Event
1:01
pm
Ad Click Event
Ad
Quality
Processor
User1
Did user click
the Ad
within 2
minutes of
seeing the
Ad

DELAYS – AN
EXAMPLE
Ad Quality
Processor
(Samza)
Services Tier
Kafka
Services Tier
Ad Quality
Processor
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdViewEvent
L
B

DELAYS – AN
EXAMPLE
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdClick Event
L
B

What do we need to do to get accurate
results?
Deal with
• Late Arrivals
– E.g. AdClick event showed up 5 minutes late.
• Out of order arrival
– E.g. AdClick event showed up before AdView
event
• Influenced by “Google MillWheel”

Solution
Kafka
AdClicks
Processing Job
output
Kafka
Task1
Task2
Task3
Message
Store
Kafka
AdView Message
Store
Message
Store
1. All events are stored locally
2. Find impacted ‘window/s’ for late
arrivals
3. Recompute result
4. Choose strategy for emitting results
(absolute or relative value)

Myth: This isn’t a problem with
Lambda Architecture..
• Theory: Since the processing happens 1 hour
or several hours later delays are not a
problem.
• Ok.. But what about the “edges”
– Some “sessions” start before the cut off time for
processing.. And end after the cut off time.
– Delays and out of order processing make things
worse on the edges

Processing
1. Scale but at not at any cost
2. Reprocessing
4. Easy Programmability

Easy Programmability
• Support for “accurate” Windowing/Joins.
( Google Cloud Dataflow )
• Ability to express workflows/DAGs in config
and DSL (e.g. Storm)
• SQL support for querying over streams
– Azure Stream Insight
• Apache Samza – working on the above

Some scale numbers at LinkedIn
• 1.3 Trillion Messages get ingested into Kafka per
day
– Each message gets consumed 4-5 times
• Database change capture :
– More than 2 Trillion Messages get consumed per
week
• Samza jobs in production which process more
than 1 Million messages/sec
Note: These numbers are not reflective of LinkedIn Site traffic

References
• http://samza.apache.org/
• http://kafka.apache.org/
• https://github.com/linkedin/databus
• http://cs.brown.edu/~ugur/8rulesSigRec.pdf
• http://www.cs.cmu.edu/~pavlo/courses/fall20
13/static/papers/p734-akidau.pdf

Essential Ingredients of Realtime Stream Processing @ Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Essential Ingredients of Realtime Stream Processing @ Scale

Similar to Essential Ingredients of Realtime Stream Processing @ Scale (20)

Recently uploaded

Recently uploaded (20)

Essential Ingredients of Realtime Stream Processing @ Scale

Editor's Notes