SlideShare a Scribd company logo
1 of 39
Apache Samza:
Taking stream processing
to the next level
Martin Kleppmann
 Co-founded two startups, Rapportive ⇒ LinkedIn
 Committer on Avro & Samza ⇒ Apache
 Writing book on data-intensive apps ⇒ O’Reilly
 martinkl.com | @martinkl
Apache Kafka Apache Samza
Apache Kafka Apache Samza
Credit: Jason Walsh on Flickr
https://www.flickr.com/photos/22882695@N00/2477241427/
Credit: Lucas Richarz on Flickr
https://www.flickr.com/photos/22057420@N00/5690285549
Things we would like to do
(better)
Provide timely, relevant updates to your newsfeed
Update search results with new information as it
appears
“Real-time” analysis of logs and metrics
Tools?
Response latency
Kafka & Samza
Milliseconds to minutes
Loosely coupled
REST
Synchronous
Closely coupled
Hours to days
Loosely
Service 1
Kafka
events/messages
Analytics
Cache
maintenance
Notifications
subscribe subscribe subscribe
publish publish
Service 2
Publish / subscribe
 Event / message = “something happened”
– Tracking: User x clicked y at time z
– Data change: Key x, old value y, set to new value z
– Logging: Service x threw exception y in
request z
– Metrics: Machine x had free memory y at time
z
 Many independent consumers
Kafka at LinkedIn
 350+ Kafka brokers
 8,000+ topics
 140,000+ Partitions
 278 Billion messages/day
 49 TB/day in
 176 TB/day out
 Peak Load
– 4.4 Million messages per second
– 6 Gigabits/sec Inbound
– 21 Gigabits/sec Outbound
public interface StreamTask {
void process( IncomingMessageEnvelope
envelope,
MessageCollector collector,
TaskCoordinator coordinator);
}
Samza API: processing messages
getKey(), getMsg()
commit(), shutdown()
sendMsg(topic, key, val
Familiar ideas from
MR/Pig/Cascading/…
 Filter records matching condition
 Map record ⇒ func(record)
 Join two/more datasets by key
 Group records with the same value in field
 Aggregate records within the same group
 Pipe job 1’s output ⇒ job 2’s input
 MapReduce assumes fixed dataset.
Can we adapt this to unbounded streams?
Operations on streams
 Filter records matching condition✔ easy
 Map record ⇒ func(record) ✔ easy
 Join two/more datasets by key
…within time window, need buffer
 Group records with the same value in field
…within time window, need buffer
 Aggregate records within the same group
✔ ok… when do you emit result?
 Pipe job 1’s output ⇒ job 2’s input
✔ ok… but what about faults?
Stateful stream processing
(join, group, aggregate)
Joining streams requires state
 User goes to lunch ⇒ click long after impression
 Queue backlog ⇒ click before impression
 “Window join”
Join and aggregate
Click-through rate
Key-value
store
Ad impressionsAd clicks
Remote state or local state?
Samza job partition 0 Samza job partition 1
e.g. Cassandra, MongoDB,
…
100-500k msg/sec/node 100-500k msg/sec/node
1-5k queries/sec??
Remote state or local state?
Samza job partition 0 Samza job partition 1
Local
LevelDB/
RocksDB
Local
LevelDB/
RocksDB
Another example: Newsfeed &
following
 User 138 followed user 582
 User 463 followed user 536
 User 582 posted: “I’m at Berlin Buzzwords and it rocks”
 User 507 unfollowed user 115
 User 536 posted: “Nice weather today, going for a walk”
 User 981 followed user 575
 Expected output: “inbox” (newsfeed) for each user
Newsfeed & following
Fan out messages to
followers
Delivered messages
582 => [
138, 721,
…
]
Follow/unfollow
events
Posted messages
User 582 posted: “I’m at Berlin
Buzzwords and it rocks” User 138 followed user 582
Notify user 138: {User 582 posted:
“I’m at Berlin Buzzwords and it rocks”}Push notifications etc.
Local state:
Bring computation and data
together in one place
Fault tolerance
Kafka Kafka
YARN NodeManager YARN NodeManager
YARN
RM Samza Container
Samza Container
Samza Container
Samza Container
Machine 1 Machine 2
Task Task
Task Task
Task Task
Task Task
BOOM
Kafka Kafka
YARN NodeManager YARN NodeManager
YARN
RM Samza Container
Samza Container
Samza Container
Samza Container
Machine 1 Machine 2
Task Task
Task Task
Task Task
Task Task
BOOM
YARN NodeManager
Samza Container
Samza Container
Kafka
YARN NodeManager
Samza Container
Samza Container
Machine 2
Task Task
Task Task
Kafka
Machine 3
Task Task
Task Task

Fault-tolerant local state
Samza job partition 0 Samza job partition 1
Local
LevelDB/
RocksDB
Local
LevelDB/
RocksDBDurable changelog
replicate writes
YARN NodeManager
Samza Container
Samza Container
Kafka
YARN NodeManager
Samza Container
Samza Container
Machine 2
Task Task
Task Task
Machine 3
Task Task
Task Task

Kafka
Samza’s fault-tolerant local state
 Embedded key-value: very fast
 Machine dies ⇒ local key-value store is lost
 Solution: replicate all writes to Kafka!
 Machine dies ⇒ restart on another machine
 Restore key-value store from changelog
 Changelog compaction in the background (Kafka 0.8.1)
When things go slow…
Owned by
Team X
Team Y Team Z
Cascades of jobs
Job 1
Stream BStream A
Job 2 Job 3
Job 4 Job 5
Consumer goes slow
Backpressur
e
Queue upDrop data
Other jobs grind
to a halt 
Run out of
memory 
Spill to disk
Oh wait… Kafka does this anyway!
No thanks 
Job 1
Stream BStream A
Job 2 Job 3
Job 4 Job 5
Job 1 output
Job 2 output Job 3 output
Samza always writes
job output to Kafka
MapReduce always writes
job output to HDFS
Every job output is a named stream
 Open: Anyone can consume it
 Robust: If a consumer goes slow, nobody else is
affected
 Durable: Tolerates machine failure
 Debuggable: Just look at it
 Scalable: Clean interface between teams
 Clean: loose coupling between jobs
Problem Solution
Need to buffer job output
for downstream consumers
Write it to Kafka!
Need to make local state
store fault-tolerant
Write it to Kafka!
Need to checkpoint job
state for recovery
Write it to Kafka!
Recap
Apache Kafka Apache Samza
kafka.apache.org samza.incubator.apache.org
Hello Samza (try Samza in 5 mins)
Thank you!
Samza:
• Getting started: samza.incubator.apache.org
• Underlying thinking: bit.ly/jay_on_logs
• Start contributing: bit.ly/samza_newbie_issues
Me:
• Twitter: @martinkl
• Blog: martinkl.com

More Related Content

What's hot

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and MessagingXin Wang
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Harvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's FeedHarvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's FeedMohamed El-Geish
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsIgor Mielientiev
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacreconfluent
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...confluent
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
 

What's hot (20)

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Harvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's FeedHarvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's Feed
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streams
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 

Viewers also liked

Building Real-time Data Products at LinkedIn with Apache Samza
Building Real-time Data Products at LinkedIn with Apache SamzaBuilding Real-time Data Products at LinkedIn with Apache Samza
Building Real-time Data Products at LinkedIn with Apache SamzaTrieu Nguyen
 
Samza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInSamza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInC4Media
 
Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaZach Cox
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
London hug-samza
London hug-samzaLondon hug-samza
London hug-samzahuguk
 
EventStore as a message broker
EventStore as a message brokerEventStore as a message broker
EventStore as a message brokerSzymonPobiega
 
Introduction Erlang - altnet fr Juin 2013
Introduction Erlang - altnet fr Juin 2013Introduction Erlang - altnet fr Juin 2013
Introduction Erlang - altnet fr Juin 2013Yann Schwartz
 
Beyond F5 - windbg et .Net
Beyond F5 - windbg et .NetBeyond F5 - windbg et .Net
Beyond F5 - windbg et .NetYann Schwartz
 
DDD-Enabling Architectures with EventStore
DDD-Enabling Architectures with EventStoreDDD-Enabling Architectures with EventStore
DDD-Enabling Architectures with EventStoreSzymonPobiega
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
A Random Walk Through Search Research
A Random Walk Through Search ResearchA Random Walk Through Search Research
A Random Walk Through Search ResearchNick Watkins
 
Patterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesPatterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesRachel Reese
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
A Crush on Design Thinking
A Crush on Design ThinkingA Crush on Design Thinking
A Crush on Design ThinkingMatteo Burgassi
 
Enterprise architectsview 2015-apr
Enterprise architectsview 2015-aprEnterprise architectsview 2015-apr
Enterprise architectsview 2015-aprMongoDB
 
How to use graphs to identify credit card thieves?
How to use graphs to identify credit card thieves?How to use graphs to identify credit card thieves?
How to use graphs to identify credit card thieves?Linkurious
 
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...Neo4j
 
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...Christopher Adams
 

Viewers also liked (20)

Building Real-time Data Products at LinkedIn with Apache Samza
Building Real-time Data Products at LinkedIn with Apache SamzaBuilding Real-time Data Products at LinkedIn with Apache Samza
Building Real-time Data Products at LinkedIn with Apache Samza
 
Samza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedInSamza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedIn
 
Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and Samza
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Samza la hug
Samza la hugSamza la hug
Samza la hug
 
London hug-samza
London hug-samzaLondon hug-samza
London hug-samza
 
EventStore as a message broker
EventStore as a message brokerEventStore as a message broker
EventStore as a message broker
 
Introduction Erlang - altnet fr Juin 2013
Introduction Erlang - altnet fr Juin 2013Introduction Erlang - altnet fr Juin 2013
Introduction Erlang - altnet fr Juin 2013
 
From Ubu to kafka
From Ubu to kafkaFrom Ubu to kafka
From Ubu to kafka
 
Beyond F5 - windbg et .Net
Beyond F5 - windbg et .NetBeyond F5 - windbg et .Net
Beyond F5 - windbg et .Net
 
DDD-Enabling Architectures with EventStore
DDD-Enabling Architectures with EventStoreDDD-Enabling Architectures with EventStore
DDD-Enabling Architectures with EventStore
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
A Random Walk Through Search Research
A Random Walk Through Search ResearchA Random Walk Through Search Research
A Random Walk Through Search Research
 
Patterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesPatterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservices
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
A Crush on Design Thinking
A Crush on Design ThinkingA Crush on Design Thinking
A Crush on Design Thinking
 
Enterprise architectsview 2015-apr
Enterprise architectsview 2015-aprEnterprise architectsview 2015-apr
Enterprise architectsview 2015-apr
 
How to use graphs to identify credit card thieves?
How to use graphs to identify credit card thieves?How to use graphs to identify credit card thieves?
How to use graphs to identify credit card thieves?
 
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...
GraphConnect Europe 2016 - Creating an Innovative Task Management Engine - Mi...
 
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...
A Related Matter: Optimizing your webapp by using django-debug-toolbar, selec...
 

Similar to Samza at LinkedIn: Taking Stream Processing to the Next Level

Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and FutureKartik Paramasivam
 
Essential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleEssential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleKartik Paramasivam
 
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
Migrating to Multi Cluster Managed Kafka - ApacheKafkaILMigrating to Multi Cluster Managed Kafka - ApacheKafkaIL
Migrating to Multi Cluster Managed Kafka - ApacheKafkaILNatan Silnitsky
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaAmazon Web Services
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman
 
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...Amazon Web Services
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beamHai Lu
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked inYi Pan
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLNick Dearden
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 

Similar to Samza at LinkedIn: Taking Stream Processing to the Next Level (20)

Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
Essential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleEssential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ Scale
 
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
Migrating to Multi Cluster Managed Kafka - ApacheKafkaILMigrating to Multi Cluster Managed Kafka - ApacheKafkaIL
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 

Recently uploaded

Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentBharaniDharan195623
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 

Recently uploaded (20)

Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managament
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 

Samza at LinkedIn: Taking Stream Processing to the Next Level

  • 1. Apache Samza: Taking stream processing to the next level
  • 2. Martin Kleppmann  Co-founded two startups, Rapportive ⇒ LinkedIn  Committer on Avro & Samza ⇒ Apache  Writing book on data-intensive apps ⇒ O’Reilly  martinkl.com | @martinkl
  • 4. Apache Kafka Apache Samza Credit: Jason Walsh on Flickr https://www.flickr.com/photos/22882695@N00/2477241427/ Credit: Lucas Richarz on Flickr https://www.flickr.com/photos/22057420@N00/5690285549
  • 5. Things we would like to do (better)
  • 6. Provide timely, relevant updates to your newsfeed
  • 7. Update search results with new information as it appears
  • 8. “Real-time” analysis of logs and metrics
  • 9. Tools? Response latency Kafka & Samza Milliseconds to minutes Loosely coupled REST Synchronous Closely coupled Hours to days Loosely
  • 11. Publish / subscribe  Event / message = “something happened” – Tracking: User x clicked y at time z – Data change: Key x, old value y, set to new value z – Logging: Service x threw exception y in request z – Metrics: Machine x had free memory y at time z  Many independent consumers
  • 12. Kafka at LinkedIn  350+ Kafka brokers  8,000+ topics  140,000+ Partitions  278 Billion messages/day  49 TB/day in  176 TB/day out  Peak Load – 4.4 Million messages per second – 6 Gigabits/sec Inbound – 21 Gigabits/sec Outbound
  • 13. public interface StreamTask { void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); } Samza API: processing messages getKey(), getMsg() commit(), shutdown() sendMsg(topic, key, val
  • 14. Familiar ideas from MR/Pig/Cascading/…  Filter records matching condition  Map record ⇒ func(record)  Join two/more datasets by key  Group records with the same value in field  Aggregate records within the same group  Pipe job 1’s output ⇒ job 2’s input  MapReduce assumes fixed dataset. Can we adapt this to unbounded streams?
  • 15. Operations on streams  Filter records matching condition✔ easy  Map record ⇒ func(record) ✔ easy  Join two/more datasets by key …within time window, need buffer  Group records with the same value in field …within time window, need buffer  Aggregate records within the same group ✔ ok… when do you emit result?  Pipe job 1’s output ⇒ job 2’s input ✔ ok… but what about faults?
  • 17. Joining streams requires state  User goes to lunch ⇒ click long after impression  Queue backlog ⇒ click before impression  “Window join” Join and aggregate Click-through rate Key-value store Ad impressionsAd clicks
  • 18. Remote state or local state? Samza job partition 0 Samza job partition 1 e.g. Cassandra, MongoDB, … 100-500k msg/sec/node 100-500k msg/sec/node 1-5k queries/sec??
  • 19. Remote state or local state? Samza job partition 0 Samza job partition 1 Local LevelDB/ RocksDB Local LevelDB/ RocksDB
  • 20. Another example: Newsfeed & following  User 138 followed user 582  User 463 followed user 536  User 582 posted: “I’m at Berlin Buzzwords and it rocks”  User 507 unfollowed user 115  User 536 posted: “Nice weather today, going for a walk”  User 981 followed user 575  Expected output: “inbox” (newsfeed) for each user
  • 21. Newsfeed & following Fan out messages to followers Delivered messages 582 => [ 138, 721, … ] Follow/unfollow events Posted messages User 582 posted: “I’m at Berlin Buzzwords and it rocks” User 138 followed user 582 Notify user 138: {User 582 posted: “I’m at Berlin Buzzwords and it rocks”}Push notifications etc.
  • 22. Local state: Bring computation and data together in one place
  • 24. Kafka Kafka YARN NodeManager YARN NodeManager YARN RM Samza Container Samza Container Samza Container Samza Container Machine 1 Machine 2 Task Task Task Task Task Task Task Task BOOM
  • 25. Kafka Kafka YARN NodeManager YARN NodeManager YARN RM Samza Container Samza Container Samza Container Samza Container Machine 1 Machine 2 Task Task Task Task Task Task Task Task BOOM
  • 26. YARN NodeManager Samza Container Samza Container Kafka YARN NodeManager Samza Container Samza Container Machine 2 Task Task Task Task Kafka Machine 3 Task Task Task Task 
  • 27. Fault-tolerant local state Samza job partition 0 Samza job partition 1 Local LevelDB/ RocksDB Local LevelDB/ RocksDBDurable changelog replicate writes
  • 28. YARN NodeManager Samza Container Samza Container Kafka YARN NodeManager Samza Container Samza Container Machine 2 Task Task Task Task Machine 3 Task Task Task Task  Kafka
  • 29. Samza’s fault-tolerant local state  Embedded key-value: very fast  Machine dies ⇒ local key-value store is lost  Solution: replicate all writes to Kafka!  Machine dies ⇒ restart on another machine  Restore key-value store from changelog  Changelog compaction in the background (Kafka 0.8.1)
  • 30. When things go slow…
  • 31. Owned by Team X Team Y Team Z Cascades of jobs Job 1 Stream BStream A Job 2 Job 3 Job 4 Job 5
  • 32. Consumer goes slow Backpressur e Queue upDrop data Other jobs grind to a halt  Run out of memory  Spill to disk Oh wait… Kafka does this anyway! No thanks 
  • 33. Job 1 Stream BStream A Job 2 Job 3 Job 4 Job 5 Job 1 output Job 2 output Job 3 output
  • 34. Samza always writes job output to Kafka MapReduce always writes job output to HDFS
  • 35. Every job output is a named stream  Open: Anyone can consume it  Robust: If a consumer goes slow, nobody else is affected  Durable: Tolerates machine failure  Debuggable: Just look at it  Scalable: Clean interface between teams  Clean: loose coupling between jobs
  • 36. Problem Solution Need to buffer job output for downstream consumers Write it to Kafka! Need to make local state store fault-tolerant Write it to Kafka! Need to checkpoint job state for recovery Write it to Kafka! Recap
  • 37. Apache Kafka Apache Samza kafka.apache.org samza.incubator.apache.org
  • 38. Hello Samza (try Samza in 5 mins)
  • 39. Thank you! Samza: • Getting started: samza.incubator.apache.org • Underlying thinking: bit.ly/jay_on_logs • Start contributing: bit.ly/samza_newbie_issues Me: • Twitter: @martinkl • Blog: martinkl.com

Editor's Notes

  1. Also provide interfaces for windowing tasks that are called specific amounts of time Also provide methods for initialization, configuration, etc. Checkpointing is handled behind the scenes