Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Incubator Samza: Stream Processing at LinkedIn

2,021 views

Published on

This is the slide deck that was presented at the Hadoop Users Group at LinkedIn on November 5, 2013.

The presentation covers what Samza is, why we built it, and how it works.

Published in: Technology, Business
  • Be the first to comment

Apache Incubator Samza: Stream Processing at LinkedIn

  1. 1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 9/27/2013 * Incubating
  2. 2. Stream Processing?
  3. 3. 0 ms Response latency
  4. 4. 0 ms Response latency Synchronous
  5. 5. 0 ms Response latency Synchronous Later. Possibly much later.
  6. 6. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  7. 7. Newsfeed
  8. 8. News
  9. 9. Ad Relevance
  10. 10. Email
  11. 11. Search Indexing Pipeline
  12. 12. Metrics and Monitoring
  13. 13. Motivation
  14. 14. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes
  15. 15. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers
  16. 16. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism
  17. 17. Samza Concepts & Architecture
  18. 18. Streams Partition 0 Partition 1 Partition 2
  19. 19. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  20. 20. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  21. 21. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  22. 22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  23. 23. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  24. 24. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append
  25. 25. Tasks Partition 0
  26. 26. Tasks Partition 0 Task 1
  27. 27. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  28. 28. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  29. 29. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  30. 30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  31. 31. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  32. 32. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  33. 33. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  34. 34. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  35. 35. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  36. 36. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  37. 37. Tasks Partition 0 Task 1
  38. 38. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  39. 39. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  40. 40. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  41. 41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  42. 42. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  43. 43. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  44. 44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  45. 45. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  46. 46. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  47. 47. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  48. 48. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  49. 49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  50. 50. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  51. 51. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  52. 52. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  53. 53. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  54. 54. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  55. 55. Jobs Stream A Task 1 Task 2 Stream B Task 3
  56. 56. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  57. 57. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  58. 58. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  59. 59. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  60. 60. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  61. 61. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  62. 62. YARN
  63. 63. Jobs Stream A Task 1 Task 2 Stream B Task 3
  64. 64. Containers Stream A Task 1 Task 2 Stream B Task 3
  65. 65. Containers Stream A Samza Container 1 Stream B Samza Container 2
  66. 66. Containers Samza Container 1 Samza Container 2
  67. 67. YARN Host 1 Samza Container 1 Host 2 Samza Container 2
  68. 68. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2
  69. 69. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM
  70. 70. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  71. 71. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS
  72. 72. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  73. 73. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  74. 74. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  75. 75. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  76. 76. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  77. 77. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  78. 78. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS
  79. 79. Stateful Processing
  80. 80. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  81. 81. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  82. 82. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  83. 83. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 10;
  84. 84. How do people do this?
  85. 85. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B
  86. 86. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less
  87. 87. Online vs. Async
  88. 88. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes
  89. 89. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time
  90. 90. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  91. 91. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  92. 92. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  93. 93. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  94. 94. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  95. 95. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  96. 96. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  97. 97. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  98. 98. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  99. 99. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  100. 100. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  101. 101. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  102. 102. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  103. 103. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  104. 104. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)
  105. 105. Whew!
  106. 106. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues

×