Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Samza 0.13 meetup slide v1.0.pptx

580 views

Published on

Unified single-stage and multi-stage pipelines, managed and user-defined deployment, and batch and stream processing in Samza 0.13

Published in: Engineering
  • Be the first to comment

Samza 0.13 meetup slide v1.0.pptx

  1. 1. Unified processing with the Samza High-level API Yi Pan Streams Team @LinkedIn Committer and PMC Chair, Apache Samza 1
  2. 2. Agenda • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 2
  3. 3. Application Example Application logic: Count PageViewEvent for each member in a 5 minute window and send the counts to PageViewEventPerMemberStream Re-partition by memberId window map sendTo PageViewEvent PageViewEventPer MemberStream 3
  4. 4. Application Example Re-partition window map sendTo PageViewEvent PageViewEvent ByMemberId PageViewEventPer MemberStream Job-1: PageViewRepartitionTask Job-2: PageViewByMemberIdCounterTask Application in low-level API 4
  5. 5. Application in Low-level API • Job-1: Repartition job public class PageViewRepartitionTask implements StreamTask { private final SystemStream pageViewByMIDStream = new SystemStream("kafka", "PaveViewEventByMemberId"); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); collector.send(new OutgoingMessageEnvelope(pageViewByMIDStream, pve.memberId, pve)); } } 5
  6. 6. Application in Low-level API • Job-2: Window-based counter public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { private final SystemStream pageViewCounterStream = new SystemStream("kafka", "PageViewEventPerMemberStream"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounterStream, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } } 6
  7. 7. Application in Low-level API • Job-2: Window-based counter public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { ... List<PageViewPerMemberIdCounterEvent> getWindowCounterEvent() { List<PageViewPerMemberIdCounterEvent> retList = new ArrayList<>(); Long currentTimestamp = System.currentTimeMillis(); Long cutoffTimestamp = currentTimestamp - this.windowSize; String lowerBound = String.format("%08d-", cutoffTimestamp); String upperBound = String.format("%08d-", currentTimestamp + 1); this.windowedCounters.range(lowerBound, upperBound).forEachRemaining(entry -> retList.add(entry.getValue())); return retList; } void countPageViewEvent(PageViewEvent pve) { String key = String.format("%08d-%s", (pve.timestamp - pve.timestamp % this.windowSize), pve.memberId); PageViewPerMemberIdCounterEvent counter = this.windowedCounters.get(key); if (counter == null) { counter = new PageViewPerMemberIdCounterEvent(pve.memberId, (pve.timestamp - pve.timestamp % this.windowSize), 0); } counter.count ++; this.windowedCounters.put(key, counter); } } 7
  8. 8. High Level API • Samza High Level API (NEW) – Ability to express a multi-stage processing pipeline in a single user program – Built-in library to provide high-level stream transformation functions 8
  9. 9. Application in High Level API (NEW) public class RepartitionAndCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { Supplier<Integer> initialValue = () -> 0; MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewEventStream", (k, m) -> (PageViewEvent) m); OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph .getOutputStream("pageViewEventPerMemberStream", m -> m.memberId, m -> m); pageViewEvents .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewEventPerMemberStream); } } Built-in transform functions 9
  10. 10. Application in High Level API (NEW) • Visualized execution plan Visualization: 10
  11. 11. High Level API • Built-in transformation functions in high-level API filter select a subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams statelessfunctionsI/Ofunctions stateful functions 11
  12. 12. Agenda • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 12
  13. 13. Limitations with current Samza Deployment • Tight dependency on YARN • Can’t easily port over to non-YARN clusters (e.g. Mesos, Kubernetes, AWS) • Can’t directly embed stream processing in other application (eg. a web frontend) 13
  14. 14. Flexible Deployment Model • Flexible deployment of Samza applications – Samza-as-a-library (NEW) • Run embedded stream processing in a user program • Zookeeper based coordination between multiple instances of user program – Samza in a cluster • Run stream processing as a managed program in a cluster (e.g. SamzaContainer in YARN) • Use the cluster manager (e.g. YARN) to provide deployment, coordination, and resource management 14
  15. 15. Samza-as-a-library Samza Job is composed of a collection of standalone processes ● Full control on ● Application’s life cycle ● Physical resource allocated to Samza processors ● Configuration and initialization StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator... Leader 15
  16. 16. ● ZooKeeper-based JobCoordinator (stateful use case) ● JobCoordinator uses ZooKeeper for leader election ● Leader will perform partition assignments among all active StreamProcessors Samza-as-a-library ZooKeeper StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator... 16
  17. 17. Samza-as-a-library ● Embedded application code example public class WikipediaZkLocalApplication { /** * Executes the application using the local application runner. * It takes two required command line arguments * config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name * config-path: path to application properties * * @param args command line arguments */ public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); WikipediaApplication app = new WikipediaApplication(); runner.run(app); runner.waitForFinish(); } } 17
  18. 18. Samza-as-a-library ● Embedded application code example public class WikipediaZkLocalApplication { /** * Executes the application using the local application runner. * It takes two required command line arguments * config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name * config-path: path to application properties * * @param args command line arguments */ public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); WikipediaApplication app = new WikipediaApplication(); runner.run(app); runner.waitForFinish(); } } 18 job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191
  19. 19. • Embedded application launch sequence Samza-as-a-library myApp.main() Stream Application Local Application Runner Stream Processor runner.run() streamProcessor. start() n 19
  20. 20. •Cluster-based application launch sequence Samza in a Cluster run-app.sh Remote Application Runner JobRunnerjobRunner.run() n main() app.class=my.app.MyStreamApplication Yarn RM run-jc.sh task.execute=run-local-app.sh run-local-app.sh Stream Application myApp.main() Local Application Runner Stream Processor runner.run() streamProcessor. start() n Job Coordinator 20
  21. 21. Unified StreamProcessor Design 21
  22. 22. Overview • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 22
  23. 23. Stream Application in Batch Application logic: Count PageViewEvent for each member in a 5 minute window and send the counts to PageViewEventPerMemberStream Re-partition by memberId window map sendTo PageViewEvent PageViewEventPer MemberStream HDFS PageViewEvent: hdfs://mydbsnapshot/PageViewEvent/ PageViewEventPerMemberStream: hdfs://myoutputdb/PageViewEventPerMemberFiles 23
  24. 24. Stream Application in Batch • No code change in application streams.pageViewEventStream.system=kafka streams.pageViewEventPerMemberStream.system=kafka streams.pageViewEventStream.system=hdfs streams.pageViewEventStream.physical.name=hdfs://mydbsnapshot/PageViewEvent/ streams.pageViewEventPerMemberStream.system=hdfs streams.pageViewEventPerMemberStream.physical.name=hdfs://myoutputdb/PageViewEventPerMemberFiles old config new config 24
  25. 25. Samza 0.13 Architecture 25 High-level API Unified Stream & Batch Processing Remote Runner Run in Remote Cluster Cluster-based Yarn, (Mesos) Local Runner Run Locally Embedded ZooKeeper, Standalone APIRUNNERDEPLOY MENT PROCESSOR StreamProcessor Streams Kafka, Kinesis, HDFS ... Local State RocksDb, In-Memory Remote Data Multithreading 25
  26. 26. Future Works • Samza runner for Apache Beam • Event-time processing • Support for Exactly-once processing • Support partition expansion for stateful application • Easy access to Adjunct datasets • SQL over Streams 26
  27. 27. Thank You! Q&A 27

×