Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Samza
Reliable Stream Processing Atop
Apache Kafka and Hadoop YARN

Jakob Homan

London HUG
Who I am

• Samza for five months
• Before that Hadoop, Hive, Giraph
• Say hi: @blueboxtraveler
Things we would like to do
(better)
Provide timely, relevant updates to your newsfeed
Update search results with new information as it appears
Sculpt metrics and logs into useful shapes
Tools?

RPC

Samza
Response latency
Milliseconds to minutes

Synchronous

Later. Possibly much later.
Frame(work) of reference
Storage layer

Execution
engine

Classic
Hadoop

HDFS

Map-Reduce

Samza

Kafka

YARN

API
map(k,...
Storage layer: Kafka
Apache Kafka
• Persistent,
reliable,
distributed
message queue

Shiny new logo!
At LinkedIn

10+ billion
writes per day

172k
messages per second
(average)

55+ billion
messages per day
to real-time con...
Quick aside…

Kafka: First among (pluggable) equals
LinkedIn: Espresso and Databus

Coming soon? HDFS, ActiveMQ, Amazon SQ...
Kafka in four bullet points
• Producers send messages to brokers
• Messages are key, value pairs
• Brokers store messages ...
A Kafka Topic

“The ref’s blind!”

534

“Car nicked!”

234

“Very sleepy”

755

534

Topic: StatusUpdateEvent
“Nicked a ca...
For our purposes, hash partitioned on the key!

Key

Message
contents

Message
content

Key

Message
contents

Key

Messag...
A Samza job

• StatusUpdateEvent
• NewConnectionEvent
• LikeUpdateEvent

MyStreamTask
implements StreamTask
{ …………. }

Inp...
Execution engine: YARN
What we use YARN for
• Distributing our tasks across multiple
machines
• Letting us know when one has died
• Distributing ...
YARN: Execution and reliability
MyStreamTask:process()
Samza TaskRunner: Partition 0

Samza App Master

MyStreamTask:proce...
Co-partitioning of topics
MyStreamTask:process()
StatusUpdateEvent, Partition 0

Samza TaskRunner: Partition 0

NewsUpdate...
API: process()
getKey(), getMsg()

public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collecto...
Awesome feature: State
MyStreamTask:process()
Samza TaskRunner: Partition 0
Store state

• Generic data store interface
• ...
(Pseudo)code snippet: Newsfeed
• Consume StatusUpdateEvent
– Send those updates to all your conmections via
the NewsUpdate...
public class NewsFeed implements StreamTask {
void process(envelope, collector, coordinator) {
msg = env.getMsg()
userId =...
Current status
Hello, Samza!
Up and running in 3 minutes
Consume Wikipedia edits live

Generate stats on those edits
Cool, eh? bit.ly/hel...
samza.incubator.apache.org

bit.ly/samza_newbie_issues
Cheers!

•
•
•
•
•

Quick start: bit.ly/hello-samza
Project homepage: samza.incubator.apache.org
Newbie issues: bit.ly/sam...
Upcoming SlideShare
Loading in …5
×

London hug-samza

2,191 views

Published on

Published in: Technology
  • Be the first to comment

London hug-samza

  1. 1. Apache Samza Reliable Stream Processing Atop Apache Kafka and Hadoop YARN Jakob Homan London HUG
  2. 2. Who I am • Samza for five months • Before that Hadoop, Hive, Giraph • Say hi: @blueboxtraveler
  3. 3. Things we would like to do (better)
  4. 4. Provide timely, relevant updates to your newsfeed
  5. 5. Update search results with new information as it appears
  6. 6. Sculpt metrics and logs into useful shapes
  7. 7. Tools? RPC Samza Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  8. 8. Frame(work) of reference Storage layer Execution engine Classic Hadoop HDFS Map-Reduce Samza Kafka YARN API map(k, v) => (k,v) reduce(k, list(v)) => (k,v) process(msg(k,v)) => msg(k,v)
  9. 9. Storage layer: Kafka
  10. 10. Apache Kafka • Persistent, reliable, distributed message queue Shiny new logo!
  11. 11. At LinkedIn 10+ billion writes per day 172k messages per second (average) 55+ billion messages per day to real-time consumers
  12. 12. Quick aside… Kafka: First among (pluggable) equals LinkedIn: Espresso and Databus Coming soon? HDFS, ActiveMQ, Amazon SQS
  13. 13. Kafka in four bullet points • Producers send messages to brokers • Messages are key, value pairs • Brokers store messages in topics for consumers • Consumers pull messages from brokers
  14. 14. A Kafka Topic “The ref’s blind!” 534 “Car nicked!” 234 “Very sleepy” 755 534 Topic: StatusUpdateEvent “Nicked a car!” Value: Timestamp, new status, geolocation, etc. Key: User ID of user who updated the status
  15. 15. For our purposes, hash partitioned on the key! Key Message contents Message content Key Message contents Key Message contents Message contents Key Key Message contents Key Message contents Message contents Key Partition 2 Key Partition 1 Key Partition 0 Message contents Key Kafka topics are partitioned Message content
  16. 16. A Samza job • StatusUpdateEvent • NewConnectionEvent • LikeUpdateEvent MyStreamTask implements StreamTask { …………. } Input topics Some code • NewsUpdatePost • UpdatesPerHourMetric Output topics
  17. 17. Execution engine: YARN
  18. 18. What we use YARN for • Distributing our tasks across multiple machines • Letting us know when one has died • Distributing a replacement • Isolating our tasks from each other
  19. 19. YARN: Execution and reliability MyStreamTask:process() Samza TaskRunner: Partition 0 Samza App Master MyStreamTask:process() Samza TaskRunner: Partition 1 Node Manager 1 Node Manager 2 Kafka Broker Kafka Broker Machine 1 Machine 1
  20. 20. Co-partitioning of topics MyStreamTask:process() StatusUpdateEvent, Partition 0 Samza TaskRunner: Partition 0 NewsUpdatePost NewConnectionEvent, Partition 0 An instance of StreamTask is responsible for a specific partition
  21. 21. API: process()
  22. 22. getKey(), getMsg() public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) sendMsg(topic, key, value) } commit(), shutdown()
  23. 23. Awesome feature: State MyStreamTask:process() Samza TaskRunner: Partition 0 Store state • Generic data store interface • Key-value out-of-box – More soon? Bloom filter, lucene, etc. • Restored by Samza upon task crash
  24. 24. (Pseudo)code snippet: Newsfeed • Consume StatusUpdateEvent – Send those updates to all your conmections via the NewsUpdatePost topic • Consume NewConnectionEvent – Maintain state of connections to know who to send to
  25. 25. public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”)) } } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }
  26. 26. Current status
  27. 27. Hello, Samza! Up and running in 3 minutes Consume Wikipedia edits live Generate stats on those edits Cool, eh? bit.ly/hello-samza
  28. 28. samza.incubator.apache.org bit.ly/samza_newbie_issues
  29. 29. Cheers! • • • • • Quick start: bit.ly/hello-samza Project homepage: samza.incubator.apache.org Newbie issues: bit.ly/samza_newbie_issues Detailed Samza and YARN talk: bit.ly/samza_and_yarn Twitter: @samzastream

×