Samza la hug

1,011 views
946 views

Published on

Apache Samza talk at LA HUG

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,011
On SlideShare
0
From Embeds
0
Number of Embeds
41
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  • - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  • - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  • - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  • - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
  • Provide timely, relevant updates to your newsfeed
  • Update search results with new information as it appears
  • - open area of research- been around for 20 years
  • Example – Stream 1 -> Ad Views
  • partitioned
  • re-playableorderedfault tolerantinfinitevery heavyweight definition of a stream (vs. s4, storm, etc)
  • At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API.
  • connected by stream name onlyfully buffered
  • Can also consume these streams from other jobs.
  • - can’t keep messages forever. - log compaction: delete over-written keys over time.
  • - can’t keep messages forever. - log compaction: delete over-written keys over time.
  • store API is pluggable: Lucene, buffered sort, external sort, bitmap index, bloom filters and sketches
  • Very much a production system, critical to LinkedIn
  • Samza la hug

    1. 1. Apache Samza* Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating
    2. 2. Agenda • Why Stream Processing? • What is Samza’s Design ? • How is Samza’s Design Implemented? • How can you use Samza ? • Example usage at Linkedin
    3. 3. Why Stream Processing?
    4. 4. Response latency 0 ms
    5. 5. Response latency Synchronous 0 ms
    6. 6. Response latency Synchronous Later. Possibly much later. 0 ms
    7. 7. Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 ms
    8. 8. Newsfeed Ad Relevance
    9. 9. Search Index Metrics and Monitoring
    10. 10. What is Samza’s Design ?
    11. 11. Stream A JOB Stream B Stream C
    12. 12. Stream A JOB 1 Stream B Stream C Stream D JOB 2 Stream E Stream F JOB 3 Stream G
    13. 13. Streams Partition 0 Partition 1 Partition 2
    14. 14. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    15. 15. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    16. 16. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    17. 17. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    18. 18. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    19. 19. Streams Partition 0 Partition 1 Partition 2 next append 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7
    20. 20. Jobs Stream A Stream B Task 1 Task 2 Task 3 Stream C
    21. 21. Jobs AdViews AdClicks Task 1 Task 2 Task 3 AdClickThroughRate
    22. 22. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    23. 23. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    24. 24. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    25. 25. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    26. 26. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    27. 27. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    28. 28. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    29. 29. Tasks AdViews CounterTask Partition 0 Partition 1 Ad Views - Partition 0 1 2 3 4 Output Count Stream
    30. 30. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    31. 31. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    32. 32. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    33. 33. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    34. 34. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    35. 35. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    36. 36. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    37. 37. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    38. 38. Tasks AdViews CounterTask Partition 0 Partition 1 1 2 3 4 2 Partition 1 Checkpoint Stream Ad Views - Partition 0 Output Count Stream
    39. 39. Dataflow Stream A Stream B Stream C Stream E Stream B Job 1 Job 2 Stream D Job 3
    40. 40. Dataflow Stream A Stream B Stream C Stream E Stream B Job 1 Job 2 Stream D Job 3
    41. 41. Stateful Processing • Windowed Aggregation – Counting the number of page views for each user per hour • Stream Stream Join – Join stream of ad clicks to stream of ad views to identify the view that lead to the click • Stream Table Join – Join user region info to stream of page views to create an augmented stream
    42. 42. • In memory state with checkpointing – Periodically save out the task’s in memory data – As state grows becomes very expensive – Some implementation checkpoints diffs but adds complexity How do people do this?
    43. 43. • Using an external store – Push state to an external store – Performance suffers because of remote queries – Lack of isolation – Limited query capabilities How do people do this?
    44. 44. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B
    45. 45. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B
    46. 46. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    47. 47. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    48. 48. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    49. 49. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    50. 50. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    51. 51. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    52. 52. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    53. 53. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    54. 54. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    55. 55. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    56. 56. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    57. 57. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream
    58. 58. Key-Value Store • put(table_name, key, value) • get(table_name, key) • delete(table_name, key) • range(table_name, key1, key2)
    59. 59. How is Samza’s Design Implemented?
    60. 60. Apache Kafka • Persistent, reliable, distributed message queue
    61. 61. At LinkedIn 10+ billion writes per day 172k messages per second (average) 60+ billion messages per day to real-time consumers
    62. 62. Apache Kafka • Models streams as topics • Each topic is partitioned and each partition is replicated • Producer sends messages to a topic • Messages are stored in brokers • Consumers consume from a topic (pull from broker)
    63. 63. YARN- Yet another resource negotiator • Framework to run your code on a grid of machines • Distributes our tasks across multiple machines • Notifies our framework when a task has died • Isolates our tasks from each other
    64. 64. Jobs Stream A Task 1 Task 2 Task 3 Stream B
    65. 65. Containers Task 1 Task 2 Task 3 Stream B Stream A
    66. 66. Containers Stream B Stream A Samza Container 1 Samza Container 2
    67. 67. Containers Samza Container 1 Samza Container 2
    68. 68. YARN Samza Container 1 Samza Container 2 Host 1 Host 2
    69. 69. YARN Samza Container 1 Samza Container 2 NodeManager NodeManager Host 1 Host 2
    70. 70. YARN Samza Container 1 Samza Container 2 NodeManager NodeManager Samza YARN AM Host 1 Host 2
    71. 71. YARN Samza Container 1 Samza Container 2 NodeManager Kafka Broker NodeManager Samza YARN AM Kafka Broker Host 1 Host 2
    72. 72. YARN MapReduce Container MapReduce Container NodeManager HDFS NodeManager MapReduce YARN AM HDFS Host 1 Host 2
    73. 73. YARN Samza Container 1 NodeManager Kafka Broker Host 1 Stream C Stream A Samza Container 1 Samza Container 2
    74. 74. YARN Samza Container 1 NodeManager Kafka Broker Host 1 Stream C Stream A Samza Container 1 Samza Container 2
    75. 75. YARN Samza Container 1 NodeManager Kafka Broker Host 1 Stream C Stream A Samza Container 1 Samza Container 2
    76. 76. YARN Samza Container 1 NodeManager Kafka Broker Host 1 Stream C Stream A Samza Container 1 Samza Container 2
    77. 77. YARN Samza Container 1 Samza Container 2 NodeManager Kafka Broker NodeManager Samza YARN AM Kafka Broker Host 1 Host 2
    78. 78. CGroups Samza Container 1 Samza Container 2 NodeManager Kafka Broker NodeManager Samza YARN AM Kafka Broker Host 1 Host 2
    79. 79. How can you use Samza ?
    80. 80. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    81. 81. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    82. 82. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    83. 83. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    84. 84. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    85. 85. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    86. 86. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    87. 87. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    88. 88. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    89. 89. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
    90. 90. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
    91. 91. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
    92. 92. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
    93. 93. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
    94. 94. Example usage at Linkedin
    95. 95. Call graph assembly get_unread_msg_count() get_PYMK() get_Pulse_news() get_relevant_ads() get_news_updates()
    96. 96. Lots of calls == lots of machines, logs get_unread_msg_count() get_PYMK() get_Pulse_news() get_relevant_ads() get_news_updates() unread_msg_service_call get_PYMK_service_call pulse_news_service_call add_relevance_service_call news_update_service_call
    97. 97. TreeID: Unique identifier page_view_event (123456) unread_msg_service_call (123456) another_service_call (123456) silly_service_call (123456) get_PYMK_service_call (123456) counter_service_call (123456) unread_msg_service_call (123456) count_invites_service_call (123 count_msgs_service_call (1234
    98. 98. OK, now lots of streams with TreeIDs… all_service_calls (partitioned by TreeID) Samza job: Repartition-By-TreeID *_service_call Samza job: Assemble Call Graph service_call_graphs • Near real-time holistic view of how we’re actually serving data • Compare day-over-day, cost, changes, outages
    99. 99. Thank you • Quick start: bit.ly/hello-samza • Project homepage: samza.incubator.apache.org • Newbie issues: bit.ly/samza_newbie_issues • Detailed Samza and YARN talk: bit.ly/samza_and_yarn • A must-read: http://bit.ly/jay_on_logs • Twitter: @samzastream • Me on Twitter: @sriramsub1

    ×