Your SlideShare is downloading. ×
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

4,047

Published on

Overview of Apache Samza presented to the London HUG, October 22, 2013

Overview of Apache Samza presented to the London HUG, October 22, 2013

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,047
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
94
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • RPC = lots of questions, but very quick and specificHadoop = fewer questions, but can take a long time to ponder them
  • ClassicHadoop because modern Hadoop also uses YARN and TezSamza leverages these existing technologies to build its own framework
  • Very much a production system, critical to LinkedIn
  • Log or topic, same termAt least once semanticsMessage kept around on order of days
  • Analagous to Map-ReduceInput directories =
  • Pretty standard use of YARN. Came along at exactly the right time for Samza. Nice not to have to have written something ouselves
  • Gives us distribution, task restart
  • Guarantee that messages that are partitioned on the same key will be handled by the same task.In the same way that MapReduce allows you to group on keys, copartitioning of the tasks on the keys, allows you to group on the message keysVery useful feature
  • Also provide interfaces for windowing tasks that are called specific amounts of time, number messagesAlso provide methods for initialization, configuration, etc.Checkpointing is handled behind the scenes
  • Neat feature that’s unique among current streaming
  • Note: Not how LinkedIn really does this!
  • One could imagine lots of Samza tasks consuming different events and publishing them to the NewsUpdatePostAnother task could then rank these and output them to a key value store so that the users see all the most relevant post
  • In production at LinkedInIncubatorLots of documentationLooking to build a new communityNewbie JIRAs
  • Transcript

    • 1. Apache Samza Reliable Stream Processing Atop Apache Kafka and Hadoop YARN Jakob Homan London HUG
    • 2. Who I am • Samza for five months • Before that Hadoop, Hive, Giraph • Say hi: @blueboxtraveler
    • 3. Things we would like to do (better)
    • 4. Provide timely, relevant updates to your newsfeed
    • 5. Update search results with new information as it appears
    • 6. Sculpt metrics and logs into useful shapes
    • 7. Tools? RPC Samza Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
    • 8. Frame(work) of reference Storage layer Execution engine Classic Hadoop HDFS Map-Reduce Samza Kafka YARN API map(k, v) => (k,v) reduce(k, list(v)) => (k,v) process(msg(k,v)) => msg(k,v)
    • 9. Storage layer: Kafka
    • 10. Apache Kafka • Persistent, reliable, distributed message queue Shiny new logo!
    • 11. At LinkedIn 10+ billion writes per day 172k messages per second (average) 55+ billion messages per day to real-time consumers
    • 12. Quick aside… Kafka: First among (pluggable) equals LinkedIn: Espresso and Databus Coming soon? HDFS, ActiveMQ, Amazon SQS
    • 13. Kafka in four bullet points • Producers send messages to brokers • Messages are key, value pairs • Brokers store messages in topics for consumers • Consumers pull messages from brokers
    • 14. A Kafka Topic “The ref’s blind!” 534 “Car nicked!” 234 “Very sleepy” 755 534 Topic: StatusUpdateEvent “Nicked a car!” Value: Timestamp, new status, geolocation, etc. Key: User ID of user who updated the status
    • 15. For our purposes, hash partitioned on the key! Key Message contents Message content Key Message contents Key Message contents Message contents Key Key Message contents Key Message contents Message contents Key Partition 2 Key Partition 1 Key Partition 0 Message contents Key Kafka topics are partitioned Message content
    • 16. A Samza job • StatusUpdateEvent • NewConnectionEvent • LikeUpdateEvent MyStreamTask implements StreamTask { …………. } Input topics Some code • NewsUpdatePost • UpdatesPerHourMetric Output topics
    • 17. Execution engine: YARN
    • 18. What we use YARN for • Distributing our tasks across multiple machines • Letting us know when one has died • Distributing a replacement • Isolating our tasks from each other
    • 19. YARN: Execution and reliability MyStreamTask:process() Samza TaskRunner: Partition 0 Samza App Master MyStreamTask:process() Samza TaskRunner: Partition 1 Node Manager 1 Node Manager 2 Kafka Broker Kafka Broker Machine 1 Machine 1
    • 20. Co-partitioning of topics MyStreamTask:process() StatusUpdateEvent, Partition 0 Samza TaskRunner: Partition 0 NewsUpdatePost NewConnectionEvent, Partition 0 An instance of StreamTask is responsible for a specific partition
    • 21. API: process()
    • 22. getKey(), getMsg() public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) sendMsg(topic, key, value) } commit(), shutdown()
    • 23. Awesome feature: State MyStreamTask:process() Samza TaskRunner: Partition 0 Store state • Generic data store interface • Key-value out-of-box – More soon? Bloom filter, lucene, etc. • Restored by Samza upon task crash
    • 24. (Pseudo)code snippet: Newsfeed • Consume StatusUpdateEvent – Send those updates to all your conmections via the NewsUpdatePost topic • Consume NewConnectionEvent – Maintain state of connections to know who to send to
    • 25. public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”)) } } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }
    • 26. Current status
    • 27. Hello, Samza! Up and running in 3 minutes Consume Wikipedia edits live Generate stats on those edits Cool, eh? bit.ly/hello-samza
    • 28. samza.incubator.apache.org bit.ly/samza_newbie_issues
    • 29. Cheers! • • • • • Quick start: bit.ly/hello-samza Project homepage: samza.incubator.apache.org Newbie issues: bit.ly/samza_newbie_issues Detailed Samza and YARN talk: bit.ly/samza_and_yarn Twitter: @samzastream

    ×