Samza: Real-time Stream Processing at LinkedIn

1,943 views
1,741 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv.

Chris Riccomini discusses: Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap. Filmed at qconsf.com.

Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.

Published in: Technology, Business
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,943
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
0
Comments
0
Likes
19
Embeds 0
No embeds

No notes for slide

Samza: Real-time Stream Processing at LinkedIn

  1. 1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 11/13/2013 * Incubating
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /samza-linkedin InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Stream Processing?
  5. 5. 0 ms Response latency
  6. 6. 0 ms Response latency Synchronous
  7. 7. 0 ms Response latency Synchronous Later. Possibly much later.
  8. 8. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  9. 9. Newsfeed
  10. 10. News
  11. 11. Ad Relevance
  12. 12. Email
  13. 13. Search Indexing Pipeline
  14. 14. Metrics and Monitoring
  15. 15. Motivation
  16. 16. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes
  17. 17. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers
  18. 18. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism
  19. 19. Samza Concepts & Architecture
  20. 20. Streams Partition 0 Partition 1 Partition 2
  21. 21. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  22. 22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  23. 23. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  24. 24. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  25. 25. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  26. 26. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append
  27. 27. Tasks Partition 0
  28. 28. Tasks Partition 0 Task 1
  29. 29. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  30. 30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  31. 31. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  32. 32. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  33. 33. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  34. 34. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  35. 35. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  36. 36. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  37. 37. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  38. 38. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  39. 39. Tasks Partition 0 Task 1
  40. 40. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  41. 41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  42. 42. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  43. 43. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  44. 44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  45. 45. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  46. 46. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  47. 47. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  48. 48. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  49. 49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  50. 50. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  51. 51. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  52. 52. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  53. 53. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  54. 54. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  55. 55. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  56. 56. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  57. 57. Jobs Stream A Task 1 Task 2 Stream B Task 3
  58. 58. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  59. 59. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  60. 60. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  61. 61. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  62. 62. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  63. 63. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  64. 64. YARN
  65. 65. YARN You: I want to run command X on two machines with 512M of memory.
  66. 66. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code?
  67. 67. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz
  68. 68. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz YARN: I’ve run your command on grid-node-2 and grid-node-7.
  69. 69. YARN Host 1 Host 2 Host 3
  70. 70. YARN Host 1 Host 2 Host 3 NM NM NM
  71. 71. YARN Host 0 RM Host 1 Host 2 Host 3 NM NM NM
  72. 72. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  73. 73. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  74. 74. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  75. 75. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  76. 76. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  77. 77. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  78. 78. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  79. 79. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  80. 80. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  81. 81. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  82. 82. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  83. 83. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  84. 84. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  85. 85. Jobs Stream A Task 1 Task 2 Stream B Task 3
  86. 86. Containers Stream A Task 1 Task 2 Stream B Task 3
  87. 87. Containers Stream A Samza Container 1 Stream B Samza Container 2
  88. 88. Containers Samza Container 1 Samza Container 2
  89. 89. YARN Host 1 Samza Container 1 Host 2 Samza Container 2
  90. 90. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2
  91. 91. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM
  92. 92. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  93. 93. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS
  94. 94. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  95. 95. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  96. 96. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  97. 97. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  98. 98. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  99. 99. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  100. 100. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS
  101. 101. Stateful Processing
  102. 102. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  103. 103. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  104. 104. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  105. 105. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 10;
  106. 106. How do people do this?
  107. 107. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B
  108. 108. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less
  109. 109. Online vs. Async
  110. 110. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes
  111. 111. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time
  112. 112. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  113. 113. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  114. 114. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  115. 115. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  116. 116. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  117. 117. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  118. 118. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  119. 119. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  120. 120. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  121. 121. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  122. 122. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  123. 123. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  124. 124. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  125. 125. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  126. 126. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)
  127. 127. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  128. 128. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  129. 129. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  130. 130. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  131. 131. Whew!
  132. 132. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues
  133. 133. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/samzalinkedin

×