SlideShare a Scribd company logo
Essential Ingredients of Stream
Processing @ Scale
Kartik Paramasivam
About Me
• ‘Streams Infrastructure’ at LinkedIn
– Pub-sub messaging : Apache Kafka
– Change Capture from various data systems: Databus
– Stream Processing platform : Apache Samza
• Previous
– Microsoft Cloud/IOT Messaging (EventHub) and
Enterprise Messaging(Queues/Topics)
– .NET WebServices and Workflow stack
– BizTalk Server
Agenda
• What is Stream Processing ?
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms
Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
Newsfeed
Cyber-security
Internet of Things
Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
CANONICAL
ARCHITECTURE
Data-
Bus
Real Time
Processing
(Samza)
Batch
Processing
(Hadoop/Spark)
Voldem
ort R/Oe.g.
Espresso
Processing
Bulk
upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices, sensors ….)
Kafka
Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
Essential Ingredients to Stream
Processing
1. Scale
2. Reprocessing
3. Accuracy of results
4. Easy to program
SCALE.. but not at any cost
Basics : Scaling Ingestion
- Streams are partitioned
- Messages sent to partitions
based on PartitionKey
- Time based message
retention
Stream A
producers
Pkey=10
consumerA
(machine1)
consumerA
(machine2)
Pkey=25 Pkey=45
e.g. Kafka, AWS Kinesis, Azure EventHub
Scaling Processing.. E.g. Samza
Stream A
Task 1 Task 2 Task 3
Stream B
Samza Job
Samza – Streaming Dataflow
Stream A
Stream c
Stream D
Job 1
Job 2
Stream B
Horizontal Scaling is great ! But..
• But more machines means more $$
• Need to do more with less.
• So what’s the key bottleneck during
Event/Stream Processing ?
Key Bottleneck: “Accessing Data”
• Big impact on CPU, Network, Disk
• Types of Data Access
1. Adjunct data – Read only data
2. Scratchpad/derived data - Read-Write
data
Adjunct Data – typical access
KafkaAdClicks Processing
Job
AdQuality update
Kafka
Member
Database
Read Member Info
Concerns
1. Latency
2. CPU
3. Network
4. DDOS
Scratch pad/Derived Data – typical
access
Kafka
Sensor
Data
Processing
Job
Alerts
Kafka
Device
State
Database
Concerns
1. Latency
2. CPU
3. Network
4. DDOS
Read + Update per
Device Info
Adjunct Data – with Samza
KafkaAdClicks
Processing Job
output
Kafka
Member
Database
(espresso) Databus
Kafka, Databus, Database, Samza Job are all
partitioned by MemberId
Member
Updates
Task1
Task2
Task3
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Stable State
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Host A dies/fails
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
YARN allocates the
tasks to a container
on a different host!
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Restore local state by
reading from the
ChangeLog
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Back to Stable
State
Hardware Spec: 24 cores, 1Gig NIC, SSD
• (Baseline) Simple pass through job with no
local state
– 1.2 Million msg/sec
• Samza job with local state
– 400k msg/sec
• Samza job with local state with Kafka backup
– 300k msg/sec
Performance Numbers with Samza
Local State - Summary
• Great for both read-only data and read-write
data
• Secret sauce to make local state work
1. Change Capture System: Databus/DynamoDB
streams
2. Durable backup with Kafka Log Compacted
topics
Essential Ingredients to Stream
Processing
1. Scale
2. Reprocessing
3. Accuracy of results
4. Easy to program
REPROCESSING
Why do we need it ?
• Software upgrades.. Yes bugs are a reality
• Business logic changes
• First time job deployment
Reprocessing Data – with Samza
output
Kafka
Member
Database
(espresso)
Databus
Member
Updates
Company/Title/Lo
cation
StandardIzation
Job
Machine
Learning
modelbootstrap
Reprocessing- Caveats
• Stream processors are fast.. They can DOS the
system if you reprocess
– Control max-concurrency of your job
– Quotas for Kafka, Databases
– Async load into databases (Project Venice)
• Capacity
– Reprocessing a 100 TB source ?
• Doesn’t reprocessing mean you are no-longer
being real-time ?
Essential Ingredients to Stream
Processing
1. Scale but at not at any cost
2. Reprocessing
3. Accuracy of results
4. Easy to Program
ACCURACY OF RESULTS
Querying over an infinite stream
1.00
pm
Ad View Event
1:01
pm
Ad Click Event
Ad
Quality
Processor
User1
Did user click
the Ad
within 2
minutes of
seeing the
Ad
DELAYS – AN
EXAMPLE
Ad Quality
Processor
(Samza)
Services Tier
Kafka
Services Tier
Ad Quality
Processor
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdViewEvent
L
B
DELAYS – AN
EXAMPLE
Real Time
Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time
Processing
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdClick Event
L
B
What do we need to do to get accurate
results?
Deal with
• Late Arrivals
– E.g. AdClick event showed up 5 minutes late.
• Out of order arrival
– E.g. AdClick event showed up before AdView
event
• Influenced by “Google MillWheel”
Solution
Kafka
AdClicks
Processing Job
output
Kafka
Task1
Task2
Task3
Message
Store
Kafka
AdView Message
Store
Message
Store
1. All events are stored locally
2. Find impacted ‘window/s’ for late
arrivals
3. Recompute result
4. Choose strategy for emitting results
(absolute or relative value)
Myth: This isn’t a problem with
Lambda Architecture..
• Theory: Since the processing happens 1 hour
or several hours later delays are not a
problem.
• Ok.. But what about the “edges”
– Some “sessions” start before the cut off time for
processing.. And end after the cut off time.
– Delays and out of order processing make things
worse on the edges
Essential Ingredients to Stream
Processing
1. Scale but at not at any cost
2. Reprocessing
3. Accuracy of results
4. Easy Programmability
Easy Programmability
• Support for “accurate” Windowing/Joins.
( Google Cloud Dataflow )
• Ability to express workflows/DAGs in config
and DSL (e.g. Storm)
• SQL support for querying over streams
– Azure Stream Insight
• Apache Samza – working on the above
Agenda
• Stream processing Intro
• Scenarios
• Canonical Architecture
• Essential Ingredients of Stream Processing
• Close
Some scale numbers at LinkedIn
• 1.3 Trillion Messages get ingested into Kafka per
day
– Each message gets consumed 4-5 times
• Database change capture :
– More than 2 Trillion Messages get consumed per
week
• Samza jobs in production which process more
than 1 Million messages/sec
Note: These numbers are not reflective of LinkedIn Site traffic
References
• http://samza.apache.org/
• http://kafka.apache.org/
• https://github.com/linkedin/databus
• http://cs.brown.edu/~ugur/8rulesSigRec.pdf
• http://www.cs.cmu.edu/~pavlo/courses/fall20
13/static/papers/p734-akidau.pdf
Thank You!

More Related Content

What's hot

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
confluent
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
confluent
 
Kafka Streams: Revisiting the decisions of the past (How I could have made it...
Kafka Streams: Revisiting the decisions of the past (How I could have made it...Kafka Streams: Revisiting the decisions of the past (How I could have made it...
Kafka Streams: Revisiting the decisions of the past (How I could have made it...
confluent
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
confluent
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
Renato Javier Marroquín Mogrovejo
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
HostedbyConfluent
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
Jiangjie Qin
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatAdministrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
HostedbyConfluent
 
Introducing Exactly Once Semantics To Apache Kafka
Introducing Exactly Once Semantics To Apache KafkaIntroducing Exactly Once Semantics To Apache Kafka
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
Flink. Pure Streaming
Flink. Pure StreamingFlink. Pure Streaming
Flink. Pure Streaming
Indizen Technologies
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Innovate2013Breakout-RDA-2179-Frank-Ning
Innovate2013Breakout-RDA-2179-Frank-NingInnovate2013Breakout-RDA-2179-Frank-Ning
Innovate2013Breakout-RDA-2179-Frank-NingFrank Ning
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
confluent
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Big Data Spain
 
Openzipkin conf: Zipkin at Yelp
Openzipkin conf: Zipkin at YelpOpenzipkin conf: Zipkin at Yelp
Openzipkin conf: Zipkin at Yelp
Prateek Agarwal
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
Paul Brebner
 

What's hot (20)

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
 
Kafka Streams: Revisiting the decisions of the past (How I could have made it...
Kafka Streams: Revisiting the decisions of the past (How I could have made it...Kafka Streams: Revisiting the decisions of the past (How I could have made it...
Kafka Streams: Revisiting the decisions of the past (How I could have made it...
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatAdministrative techniques to reduce Kafka costs | Anna Kepler, Viasat
Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat
 
Introducing Exactly Once Semantics To Apache Kafka
Introducing Exactly Once Semantics To Apache KafkaIntroducing Exactly Once Semantics To Apache Kafka
Introducing Exactly Once Semantics To Apache Kafka
 
Flink. Pure Streaming
Flink. Pure StreamingFlink. Pure Streaming
Flink. Pure Streaming
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Innovate2013Breakout-RDA-2179-Frank-Ning
Innovate2013Breakout-RDA-2179-Frank-NingInnovate2013Breakout-RDA-2179-Frank-Ning
Innovate2013Breakout-RDA-2179-Frank-Ning
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Openzipkin conf: Zipkin at Yelp
Openzipkin conf: Zipkin at YelpOpenzipkin conf: Zipkin at Yelp
Openzipkin conf: Zipkin at Yelp
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 

Viewers also liked

Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
Julien Nioche
 
160604 メディカルカフェ
160604 メディカルカフェ160604 メディカルカフェ
160604 メディカルカフェ
Takashi Fujiwara
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
Kartik Paramasivam
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
Databricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 

Viewers also liked (9)

Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
160604 メディカルカフェ
160604 メディカルカフェ160604 メディカルカフェ
160604 メディカルカフェ
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 

Similar to Essential Ingredients of Realtime Stream Processing @ Scale

Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
Yi Pan
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
Steven Wu
 
Samza at LinkedIn
Samza at LinkedInSamza at LinkedIn
Samza at LinkedIn
Venu Ryali
 
Samza la hug
Samza la hugSamza la hug
Samza la hug
Sriram Subramanian
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
Abhishek Shivanna
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
Xinyu Liu
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
Yi Pan
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
Hai Lu
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
Hai Lu
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Overcoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for PerformanceOvercoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for Performance
ScyllaDB
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
Ed Yakabosky
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
QCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on Akka
Sean Zhong
 

Similar to Essential Ingredients of Realtime Stream Processing @ Scale (20)

Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Samza at LinkedIn
Samza at LinkedInSamza at LinkedIn
Samza at LinkedIn
 
Samza la hug
Samza la hugSamza la hug
Samza la hug
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Overcoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for PerformanceOvercoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for Performance
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
QCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on Akka
 

Recently uploaded

Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 

Recently uploaded (20)

Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 

Essential Ingredients of Realtime Stream Processing @ Scale

  • 1. Essential Ingredients of Stream Processing @ Scale Kartik Paramasivam
  • 2. About Me • ‘Streams Infrastructure’ at LinkedIn – Pub-sub messaging : Apache Kafka – Change Capture from various data systems: Databus – Stream Processing platform : Apache Samza • Previous – Microsoft Cloud/IOT Messaging (EventHub) and Enterprise Messaging(Queues/Topics) – .NET WebServices and Workflow stack – BizTalk Server
  • 3. Agenda • What is Stream Processing ? • Scenarios • Canonical Architecture • Essential Ingredients of Stream Processing • Close
  • 4. Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 ms
  • 5. Agenda • Stream processing Intro • Scenarios • Canonical Architecture • Essential Ingredients of Stream Processing • Close
  • 9. Agenda • Stream processing Intro • Scenarios • Canonical Architecture • Essential Ingredients of Stream Processing • Close
  • 11. Agenda • Stream processing Intro • Scenarios • Canonical Architecture • Essential Ingredients of Stream Processing • Close
  • 12. Essential Ingredients to Stream Processing 1. Scale 2. Reprocessing 3. Accuracy of results 4. Easy to program
  • 13. SCALE.. but not at any cost
  • 14. Basics : Scaling Ingestion - Streams are partitioned - Messages sent to partitions based on PartitionKey - Time based message retention Stream A producers Pkey=10 consumerA (machine1) consumerA (machine2) Pkey=25 Pkey=45 e.g. Kafka, AWS Kinesis, Azure EventHub
  • 15. Scaling Processing.. E.g. Samza Stream A Task 1 Task 2 Task 3 Stream B Samza Job
  • 16. Samza – Streaming Dataflow Stream A Stream c Stream D Job 1 Job 2 Stream B
  • 17. Horizontal Scaling is great ! But.. • But more machines means more $$ • Need to do more with less. • So what’s the key bottleneck during Event/Stream Processing ?
  • 18. Key Bottleneck: “Accessing Data” • Big impact on CPU, Network, Disk • Types of Data Access 1. Adjunct data – Read only data 2. Scratchpad/derived data - Read-Write data
  • 19. Adjunct Data – typical access KafkaAdClicks Processing Job AdQuality update Kafka Member Database Read Member Info Concerns 1. Latency 2. CPU 3. Network 4. DDOS
  • 20. Scratch pad/Derived Data – typical access Kafka Sensor Data Processing Job Alerts Kafka Device State Database Concerns 1. Latency 2. CPU 3. Network 4. DDOS Read + Update per Device Info
  • 21. Adjunct Data – with Samza KafkaAdClicks Processing Job output Kafka Member Database (espresso) Databus Kafka, Databus, Database, Samza Job are all partitioned by MemberId Member Updates Task1 Task2 Task3
  • 22. Fault Tolerance in a stateful Samza job P0 P1 P2 P3 Task-0 Task-1 Task-2 Task-3 P0 P1 P2 P3 Host-A Host-B Host-C Changelog Stream Stable State
  • 23. Fault Tolerance in a stateful Samza job P0 P1 P2 P3 Task-0 Task-1 Task-2 Task-3 P0 P1 P2 P3 Host-A Host-B Host-C Changelog Stream Host A dies/fails
  • 24. Fault Tolerance in a stateful Samza job P0 P1 P2 P3 Task-0 Task-1 Task-2 Task-3 P0 P1 P2 P3 Host-E Host-B Host-C Changelog Stream YARN allocates the tasks to a container on a different host!
  • 25. Fault Tolerance in a stateful Samza job P0 P1 P2 P3 Task-0 Task-1 Task-2 Task-3 P0 P1 P2 P3 Host-E Host-B Host-C Changelog Stream Restore local state by reading from the ChangeLog
  • 26. Fault Tolerance in a stateful Samza job P0 P1 P2 P3 Task-0 Task-1 Task-2 Task-3 P0 P1 P2 P3 Host-E Host-B Host-C Changelog Stream Back to Stable State
  • 27. Hardware Spec: 24 cores, 1Gig NIC, SSD • (Baseline) Simple pass through job with no local state – 1.2 Million msg/sec • Samza job with local state – 400k msg/sec • Samza job with local state with Kafka backup – 300k msg/sec Performance Numbers with Samza
  • 28. Local State - Summary • Great for both read-only data and read-write data • Secret sauce to make local state work 1. Change Capture System: Databus/DynamoDB streams 2. Durable backup with Kafka Log Compacted topics
  • 29. Essential Ingredients to Stream Processing 1. Scale 2. Reprocessing 3. Accuracy of results 4. Easy to program
  • 31. Why do we need it ? • Software upgrades.. Yes bugs are a reality • Business logic changes • First time job deployment
  • 32. Reprocessing Data – with Samza output Kafka Member Database (espresso) Databus Member Updates Company/Title/Lo cation StandardIzation Job Machine Learning modelbootstrap
  • 33. Reprocessing- Caveats • Stream processors are fast.. They can DOS the system if you reprocess – Control max-concurrency of your job – Quotas for Kafka, Databases – Async load into databases (Project Venice) • Capacity – Reprocessing a 100 TB source ? • Doesn’t reprocessing mean you are no-longer being real-time ?
  • 34. Essential Ingredients to Stream Processing 1. Scale but at not at any cost 2. Reprocessing 3. Accuracy of results 4. Easy to Program
  • 36. Querying over an infinite stream 1.00 pm Ad View Event 1:01 pm Ad Click Event Ad Quality Processor User1 Did user click the Ad within 2 minutes of seeing the Ad
  • 37. DELAYS – AN EXAMPLE Ad Quality Processor (Samza) Services Tier Kafka Services Tier Ad Quality Processor (Samza) KafkaMirrored kartik DATACENTER 1 DATACENTER 2 AdViewEvent L B
  • 38. DELAYS – AN EXAMPLE Real Time Processing (Samza) Services Tier Kafka Services Tier Real Time Processing (Samza) KafkaMirrored kartik DATACENTER 1 DATACENTER 2 AdClick Event L B
  • 39. What do we need to do to get accurate results? Deal with • Late Arrivals – E.g. AdClick event showed up 5 minutes late. • Out of order arrival – E.g. AdClick event showed up before AdView event • Influenced by “Google MillWheel”
  • 40. Solution Kafka AdClicks Processing Job output Kafka Task1 Task2 Task3 Message Store Kafka AdView Message Store Message Store 1. All events are stored locally 2. Find impacted ‘window/s’ for late arrivals 3. Recompute result 4. Choose strategy for emitting results (absolute or relative value)
  • 41. Myth: This isn’t a problem with Lambda Architecture.. • Theory: Since the processing happens 1 hour or several hours later delays are not a problem. • Ok.. But what about the “edges” – Some “sessions” start before the cut off time for processing.. And end after the cut off time. – Delays and out of order processing make things worse on the edges
  • 42. Essential Ingredients to Stream Processing 1. Scale but at not at any cost 2. Reprocessing 3. Accuracy of results 4. Easy Programmability
  • 43. Easy Programmability • Support for “accurate” Windowing/Joins. ( Google Cloud Dataflow ) • Ability to express workflows/DAGs in config and DSL (e.g. Storm) • SQL support for querying over streams – Azure Stream Insight • Apache Samza – working on the above
  • 44. Agenda • Stream processing Intro • Scenarios • Canonical Architecture • Essential Ingredients of Stream Processing • Close
  • 45. Some scale numbers at LinkedIn • 1.3 Trillion Messages get ingested into Kafka per day – Each message gets consumed 4-5 times • Database change capture : – More than 2 Trillion Messages get consumed per week • Samza jobs in production which process more than 1 Million messages/sec Note: These numbers are not reflective of LinkedIn Site traffic
  • 46. References • http://samza.apache.org/ • http://kafka.apache.org/ • https://github.com/linkedin/databus • http://cs.brown.edu/~ugur/8rulesSigRec.pdf • http://www.cs.cmu.edu/~pavlo/courses/fall20 13/static/papers/p734-akidau.pdf

Editor's Notes

  1. the 3 areas ingestion, Processing serving
  2. Sure .. We can add caches.. But then how do the caches get populated and kept in sync.
  3. Sure .. We can add caches.. But then how do the caches get populated and kept in sync.
  4. Sure .. We can add caches.. But then how do the caches get populated and kept in sync.
  5. Sure .. We can add caches.. But then how do the caches get populated and kept in sync.
  6. Note: These numbers are not reflective of LinkedIn site traffic.