Apache Incubator Samza: Stream Processing at LinkedIn

Chris Riccomini
Chris RiccominiEngineer, investor, author, writer
Apache Samza*
Stream Processing at LinkedIn
Chris Riccomini
9/27/2013

* Incubating
Apache Incubator Samza: Stream Processing at LinkedIn
Stream Processing?
0 ms

Response latency
0 ms

Response latency

Synchronous
0 ms

Response latency

Synchronous

Later. Possibly much later.
0 ms

Response latency
Milliseconds to minutes
Synchronous

Later. Possibly much later.
Newsfeed
News
Ad Relevance
Email
Search Indexing Pipeline
Metrics and Monitoring
Motivation
Real-time Feeds
•
•
•
•

User activity
Metrics
Monitoring
Database Changes
Real-time Feeds
• 10+ billion writes per day
• 172,000 messages per second (average)
• 55+ billion messages per day to real-time
consumers
Stream Processing is Hard
•
•
•
•
•
•

Partitioning
State
Re-processing
Failure semantics
Joins to services or database
Non-determinism
Samza Concepts
&
Architecture
Streams
Partition 0

Partition 1

Partition 2
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7

next append
Tasks
Partition 0
Tasks
Partition 0

Task 1
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
Partition 0

Task 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Partition 0
Partition 1
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
YARN
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Samza Container 1

Stream B

Samza Container 2
Containers

Samza Container 1

Samza Container 2
YARN
Host 1

Samza Container 1

Host 2

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2

Samza YARN AM
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Broker
YARN
Host 1

Host 2

NodeManager

NodeManager

MapReduce
Container

HDFS

MapReduce
YARN AM

MapReduce
Container

HDFS
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Broker
CGroups
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Broker
(Not Running) Multi-Framework
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka

MapReduce
Container

Samza YARN AM

HDFS
Stateful Processing
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER BY
count(*) DESC
LIMIT 50;
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER BY
count(*) DESC
LIMIT 50;
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER BY
count(*) DESC
LIMIT 50;
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER BY
count(*) DESC
LIMIT 10;
How do people do this?
Remote Stores
Stream A

Task 1

Task 2

Task 3

Key-Value
Store
Stream B
Remote RPC is slow
• Stream: ~500k records/sec/container
• DB: << less
Online vs. Async
No undo
• Database state is non-deterministic
• Can’t roll back mutations if task crashes
Tables & Streams
put(a, w)
put(b, x)
Database

put(a, y)

put(b, z)

Time
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Key-Value Store
•
•
•
•

put(table_name, key, value)
get(table_name, key)
delete(table_name, key)
range(table_name, key1, key2)
Whew!
Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Grab some newbie JIRAs
http://bit.ly/samza_newbie_issues
1 of 107

Recommended

Samza: Real-time Stream Processing at LinkedIn by
Samza: Real-time Stream Processing at LinkedInSamza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInC4Media
2.9K views135 slides
Event Stream Processing with Kafka and Samza by
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaZach Cox
8.5K views49 slides
Streaming Analytics & CEP - Two sides of the same coin? by
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann
2.7K views29 slides
Big Data Warsaw by
Big Data WarsawBig Data Warsaw
Big Data WarsawMaximilian Michels
921 views95 slides
Unified Stream Processing at Scale with Apache Samza - BDS2017 by
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
1.9K views32 slides
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ... by
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...confluent
2.3K views78 slides

More Related Content

What's hot

Data Streaming Ecosystem Management at Booking.com by
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
6K views75 slides
HBaseCon2017 Data Product at AirBnB by
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon
781 views33 slides
Debunking Common Myths in Stream Processing by
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
607 views41 slides
Principles in Data Stream Processing | Matthias J Sax, Confluent by
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
628 views36 slides
Stream Processing using Samza SQL by
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQLSamarth Shetty
791 views32 slides
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A... by
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...confluent
1.6K views52 slides

What's hot(20)

Data Streaming Ecosystem Management at Booking.com by confluent
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
confluent6K views
HBaseCon2017 Data Product at AirBnB by HBaseCon
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
HBaseCon781 views
Debunking Common Myths in Stream Processing by Kostas Tzoumas
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas607 views
Principles in Data Stream Processing | Matthias J Sax, Confluent by HostedbyConfluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
HostedbyConfluent628 views
Stream Processing using Samza SQL by Samarth Shetty
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQL
Samarth Shetty791 views
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A... by confluent
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
confluent1.6K views
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen by confluent
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent8.4K views
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac... by Ververica
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica 775 views
Top Ten Kafka® Configs by confluent
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
confluent1.1K views
Apache HBase at Airbnb by HBaseCon
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
HBaseCon5.9K views
The State of Stream Processing by confluent
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
confluent748 views
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F... by Flink Forward
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward622 views
Bootstrapping Microservices with Kafka, Akka and Spark by Alex Silva
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
Alex Silva2.2K views
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ... by confluent
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
confluent497 views
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018) by confluent
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
confluent1.2K views
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016) by ucelebi
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi1.7K views
Apache Flink: Streaming Done Right @ FOSDEM 2016 by Till Rohrmann
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann2K views
Log everything! @DC13 by DECK36
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
DECK36905 views

Viewers also liked

Samza at LinkedIn: Taking Stream Processing to the Next Level by
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
4.4K views39 slides
London hug-samza by
London hug-samzaLondon hug-samza
London hug-samzahuguk
2.6K views29 slides
Apache Incubator Samza: Stream Processing at LinkedIn by
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
1.7K views132 slides
Samza la hug by
Samza la hugSamza la hug
Samza la hugSriram Subramanian
1.4K views100 slides
Benchmarking Apache Samza: 1.2 million messages per sec per node by
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
876 views16 slides
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN by
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
7.6K views29 slides

Viewers also liked(6)

Samza at LinkedIn: Taking Stream Processing to the Next Level by Martin Kleppmann
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann4.4K views
London hug-samza by huguk
London hug-samzaLondon hug-samza
London hug-samza
huguk2.6K views
Apache Incubator Samza: Stream Processing at LinkedIn by Chris Riccomini
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
Chris Riccomini1.7K views
Benchmarking Apache Samza: 1.2 million messages per sec per node by Tao Feng
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
Tao Feng876 views
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN by blueboxtraveler
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler7.6K views

Similar to Apache Incubator Samza: Stream Processing at LinkedIn

LinkedIn-Teradata Summit feb 25, 2015 by
LinkedIn-Teradata Summit feb 25, 2015LinkedIn-Teradata Summit feb 25, 2015
LinkedIn-Teradata Summit feb 25, 2015Navina Ramesh
286 views41 slides
Samza tech talk_2015 - huawei by
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiYi Pan
417 views37 slides
#SUGCON 2015 Sitecore Monitoring by
#SUGCON 2015 Sitecore Monitoring#SUGCON 2015 Sitecore Monitoring
#SUGCON 2015 Sitecore Monitoringchriswoj
1.6K views37 slides
Samza 0.13 meetup slide v1.0.pptx by
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptxYi Pan
1.2K views27 slides
Delta Lake Streaming: Under the Hood by
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
878 views39 slides
Samza at LinkedIn by
Samza at LinkedInSamza at LinkedIn
Samza at LinkedInVenu Ryali
109 views60 slides

Similar to Apache Incubator Samza: Stream Processing at LinkedIn(20)

LinkedIn-Teradata Summit feb 25, 2015 by Navina Ramesh
LinkedIn-Teradata Summit feb 25, 2015LinkedIn-Teradata Summit feb 25, 2015
LinkedIn-Teradata Summit feb 25, 2015
Navina Ramesh286 views
Samza tech talk_2015 - huawei by Yi Pan
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
Yi Pan417 views
#SUGCON 2015 Sitecore Monitoring by chriswoj
#SUGCON 2015 Sitecore Monitoring#SUGCON 2015 Sitecore Monitoring
#SUGCON 2015 Sitecore Monitoring
chriswoj1.6K views
Samza 0.13 meetup slide v1.0.pptx by Yi Pan
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptx
Yi Pan1.2K views
Delta Lake Streaming: Under the Hood by Databricks
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks878 views
Samza at LinkedIn by Venu Ryali
Samza at LinkedInSamza at LinkedIn
Samza at LinkedIn
Venu Ryali109 views
stream-processing-at-linkedin-with-apache-samza by Abhishek Shivanna
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
Abhishek Shivanna478 views
Monitoring und Metriken im Wunderland by D
Monitoring und Metriken im WunderlandMonitoring und Metriken im Wunderland
Monitoring und Metriken im Wunderland
D 651 views
1404 app dev series - session 8 - monitoring & performance tuning by MongoDB
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB847 views
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing by DoiT International
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International900 views
SVR17: Data-Intensive Computing on Windows HPC Server with the ... by butest
SVR17: Data-Intensive Computing on Windows HPC Server with the ...SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
butest380 views
SVR17: Data-Intensive Computing on Windows HPC Server with the ... by butest
SVR17: Data-Intensive Computing on Windows HPC Server with the ...SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
butest425 views
Stream processing with Apache Flink - Maximilian Michels Data Artisans by Evention
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Evention279 views
Continuous Application with Structured Streaming 2.0 by Anyscale
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Anyscale703 views
The Art of The Event Streaming Application: Streams, Stream Processors and Sc... by confluent
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
confluent923 views
Kakfa summit london 2019 - the art of the event-streaming app by Neil Avery
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery595 views
Kafka summit SF 2019 - the art of the event-streaming app by Neil Avery
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery376 views
The art of the event streaming application: streams, stream processors and sc... by confluent
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
confluent1.1K views
Big Data-Driven Applications with Cassandra and Spark by Artem Chebotko
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko852 views
Yogesh kumar kushwah represent’s by Yogesh Kushwah
Yogesh kumar kushwah represent’sYogesh kumar kushwah represent’s
Yogesh kumar kushwah represent’s
Yogesh Kushwah455 views

Recently uploaded

Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure by
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI InfrastructureCXL Forum
125 views16 slides
Tunable Laser (1).pptx by
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptxHajira Mahmood
21 views37 slides
Web Dev - 1 PPT.pdf by
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdfgdsczhcet
52 views45 slides
"How we switched to Kanban and how it integrates with product planning", Vady... by
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...Fwdays
61 views24 slides
The Importance of Cybersecurity for Digital Transformation by
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
25 views26 slides
Throughput by
ThroughputThroughput
ThroughputMoisés Armani Ramírez
32 views11 slides

Recently uploaded(20)

Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure by CXL Forum
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum125 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet52 views
"How we switched to Kanban and how it integrates with product planning", Vady... by Fwdays
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...
Fwdays61 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays26 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... by Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 views
AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier29 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 views
Liqid: Composable CXL Preview by CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 views
MemVerge: Memory Viewer Software by CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman25 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS32 views
JCon Live 2023 - Lice coding some integration problems by Bernd Ruecker
JCon Live 2023 - Lice coding some integration problemsJCon Live 2023 - Lice coding some integration problems
JCon Live 2023 - Lice coding some integration problems
Bernd Ruecker67 views

Apache Incubator Samza: Stream Processing at LinkedIn

Editor's Notes

  1. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  2. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  3. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  4. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  5. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  6. - compute top shares, pull in, scrape, entity tag- language detection- send emails: friend was in the news- requirement: has to be fast, since news is trendy
  7. - relevance pipeline
  8. - we send relatively data rich emails- some emails are time sensitive (need to be sent soon)
  9. - time sensitive- data ingestion pattern- other systems that follow this pattern: realtimeolap system, and social graph system
  10. - ecosystem at LinkedIn (some unique traits)- hard unsolved problems in this space
  11. - oncewe had all this data in kafka, we wanted to do stuff with it.- persistent,reliable,distributed,message queue- Kafka = first among equals, but stream systems are pluggable. Just like Hadoop with HDSF vs. S3.
  12. - started with just simple web service that consumes and produces kafka messages.- realized that there are a lot of hard problems that needed to be solved.- reprocessing: what if my algorithm changes and I need to reprocess all events?- non-determinism: queries to external systems, time dependencies, ordering of messages.
  13. - open area of research- been around for 20 years
  14. partitioned
  15. re-playableorderedfault tolerantinfinitevery heavyweight definition of a stream (vs. s4, storm, etc)
  16. At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API.
  17. connected by stream name onlyfully buffered
  18. - group by, sum, count
  19. - stream to stream, stream to table, table to table
  20. - buffered sorting
  21. UDP is an over-optimization, since most processors try to remote join, which is very slow.
  22. Changelog/redologState machine model
  23. Can also consume these streams from other jobs.
  24. - can’t keep messages forever. - log compaction: delete over-written keys over time.
  25. - can’t keep messages forever. - log compaction: delete over-written keys over time.
  26. storeAPI is pluggable: Lucene, buffered sort, external sort, bitmap index, bloom filters and sketches