Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

799 views

Published on

The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.

https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Published in: Technology
  • Be the first to comment

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

  1. 1. 1 Unified Processing at Scale with Apache Samza Jake Maes Staff SW Engineer at LinkedIn Apache Samza PMC
  2. 2. 2 About Me ● Apache Samza PMC member ● LinkedIn 3 years ● 8 years performance & infra development ● Passionate about scale ● Long walks on the peaks
  3. 3. 3 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  4. 4. 4 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  5. 5. 5 About ● Production at LinkedIn since 2014 ● Apache top level project since 2014 ● 16 Committers ● 74 Contributors ● Known for  Scale  Pluggability  Kafka integration
  6. 6. 6 ● Low latency ● One message at a time ● Checkpointing, state, durability ● All I/O with high-performance message brokers Traditional Stream Processing
  7. 7. 7 Stateful Processing TaskTask0 State0 Changelog Stream (partition 0) Checkpoint Stream Processor Output Streams Input Streams (partition 0)
  8. 8. 8 Co-Partitioned Streams
  9. 9. 9 Typical Flow - Two Stages Minimum Re- partitio n windo w ma p sendT o PageVie w Event PageViewEven t ByMemberId PageViewEventP er MemberStream PageViewRepartitionTask PageViewByMemberIdCounterTask
  10. 10. 10 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  11. 11. 11 Stream Processing Ecosystem – The Dream Applications and Services Samz a Kafka Storag e Externa l Stream s Storage & Serving Brooklin
  12. 12. 12 Stream Processing Ecosystem - Reality Applications and Services Samz a Kafka Storag e Externa l Stream s Storage & Serving Brooklin
  13. 13. 13 Expansion of Stream Processing at LinkedIn ● Influx of new applications  10 -> over 200 ● New use cases  Batch  Streaming  Remote I/O  Composable API ● Incoming applications have different expectations ● Let’s take a look at two Services
  14. 14. 14 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  15. 15. 15 Online Service + Stream Processing Requirements: ● Deployment model  Cluster environment not suitable ● Remote I/O  Dependencies on other services  I/O latency stalls single threaded processor  Container parallelism - too much overhead Services
  16. 16. 16 App Instance Embedded Samza ● Zookeeper-based JobCoordinator  Uses Zookeeper for leader election  Leader assigns work to the processors ZooKeeperZooKeeper Stream Processor Samza Container Job Coordinato r* App Instance Stream Processor Samza Container Job Coordinato r App Instance Stream Processor Samza Container Job Coordinato r * Leader
  17. 17. 17 Asynchronous Event Loop Stream Processor Event Loop  Single thread  1 : Task  n : Task Restful Services Java NIO, Netty
  18. 18. 18 Checkpointing ● Sync – Barrier ● Async - Watermark t1 t2 t3 tc t4 checkpoint callback 3 complet e time callback 1 complet e callback 2compl ete callback 4 complet e
  19. 19. 19 Performance for Remote I/O Baseline Thread pool size = 10 Max concurrency = 1 Thread pool size = 10 Max concurrency = 3 Sync I/O with MultithreadingSingle thread
  20. 20. 20 Case Study – Notification Scheduler Processor User Chat Event User Action Event Connectio n Activity Event Restful Service s Member profile database Aggregatio n Engine Channel Selection State store input1 input2 input3 ① Local Data Access ② Remote Database Lookup ③ Remote Service Call outp ut
  21. 21. 21 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  22. 22. 22 Offline Jobs Requirements: ● Performance and low latency ● Resource hungry  Finite jobs can hog resources  Infinite jobs need to be better citizens ● Composable API ● Same app in batch and streaming  Best of both worlds ● HDFS I/O
  23. 23. 23 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } }
  24. 24. 24 High Level Logic public class RepartitionAndCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pve = graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m); OutputStream<String, MyOutputType, MyOutputType> mpv = graph .getOutputStream("memberPageViews", m -> m.memberId, m -> m); pve .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0, (m, c) -> c + 1)) .map(MyOutputType::new) .sendTo(mpv); } } Built-in transform functions
  25. 25. 25 High Level API - Composable Operators filter select a subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams Stateless Functions I/O Function s Stateful Functions
  26. 26. 26 Batch AND Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEven t/ streams.memberPageViews.system=hdfs streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPage Views Streaming config Batch config
  27. 27. 27 Case Study - Unified Metrics with Samza UMP Analyst Pig Script “Compile”Author Generate Fluent Code + Runtime Config Deploy+ +
  28. 28. 28 Performance - HDFS ● Profile count, group by country ● 500 files ● 250GB
  29. 29. 29 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  30. 30. 30 What’s Next? ● SQL  Prototyped 2015  Now getting full time attention ● High Level API extensions  Better config, I/O, windowing, and more ● Beam Runner  Samza performance with Beam API ● Table support
  31. 31. 31 Questions Contact: ● Email: dev@samza.apache.org ● Social: http://twitter.com/jakemaes Links: ● http://samza.apache.org ● http://github.com/apache/samza ● https://engineering.linkedin.com/blog

×