Samza: Real-time Stream Processing at LinkedIn

  • 754 views
Uploaded on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv.

Chris Riccomini discusses: Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap. Filmed at qconsf.com.

Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
754
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
16

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 11/13/2013 * Incubating
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /samza-linkedin InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Stream Processing?
  • 5. 0 ms Response latency
  • 6. 0 ms Response latency Synchronous
  • 7. 0 ms Response latency Synchronous Later. Possibly much later.
  • 8. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  • 9. Newsfeed
  • 10. News
  • 11. Ad Relevance
  • 12. Email
  • 13. Search Indexing Pipeline
  • 14. Metrics and Monitoring
  • 15. Motivation
  • 16. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes
  • 17. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers
  • 18. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism
  • 19. Samza Concepts & Architecture
  • 20. Streams Partition 0 Partition 1 Partition 2
  • 21. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  • 22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  • 23. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  • 24. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  • 25. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  • 26. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append
  • 27. Tasks Partition 0
  • 28. Tasks Partition 0 Task 1
  • 29. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 31. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 32. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 33. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 34. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 35. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 36. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 37. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 38. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  • 39. Tasks Partition 0 Task 1
  • 40. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  • 41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  • 42. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  • 43. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  • 44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  • 45. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  • 46. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  • 47. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  • 48. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 50. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 51. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 52. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 53. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 54. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 55. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 56. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  • 57. Jobs Stream A Task 1 Task 2 Stream B Task 3
  • 58. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  • 59. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  • 60. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  • 61. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  • 62. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  • 63. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  • 64. YARN
  • 65. YARN You: I want to run command X on two machines with 512M of memory.
  • 66. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code?
  • 67. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz
  • 68. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz YARN: I’ve run your command on grid-node-2 and grid-node-7.
  • 69. YARN Host 1 Host 2 Host 3
  • 70. YARN Host 1 Host 2 Host 3 NM NM NM
  • 71. YARN Host 0 RM Host 1 Host 2 Host 3 NM NM NM
  • 72. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  • 73. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  • 74. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  • 75. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 76. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 77. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 78. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  • 79. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  • 80. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 81. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 82. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 83. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  • 84. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  • 85. Jobs Stream A Task 1 Task 2 Stream B Task 3
  • 86. Containers Stream A Task 1 Task 2 Stream B Task 3
  • 87. Containers Stream A Samza Container 1 Stream B Samza Container 2
  • 88. Containers Samza Container 1 Samza Container 2
  • 89. YARN Host 1 Samza Container 1 Host 2 Samza Container 2
  • 90. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2
  • 91. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM
  • 92. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  • 93. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS
  • 94. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  • 95. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  • 96. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  • 97. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  • 98. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  • 99. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  • 100. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS
  • 101. Stateful Processing
  • 102. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  • 103. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  • 104. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  • 105. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 10;
  • 106. How do people do this?
  • 107. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B
  • 108. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less
  • 109. Online vs. Async
  • 110. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes
  • 111. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time
  • 112. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  • 113. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  • 114. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 115. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 116. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 117. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 118. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 119. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 120. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 121. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 122. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 123. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 124. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 125. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  • 126. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)
  • 127. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  • 128. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  • 129. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  • 130. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  • 131. Whew!
  • 132. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues
  • 133. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/samzalinkedin