Apache Samza Past, Present and Future


Published on

A walk through the current state of stream processing, the key differentiators which make Samza stand out in the crowd, what's new in samza and what's coming next.

Published in: Software
1 Comment
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Samza Past, Present and Future

  1. 1. Apache Samza Past, Present and Future Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn
  2. 2. Agenda 1. Stream Processing 2. State of the Union 3. Apache Samza : Key Differentiators 4. Apache Samza Futures
  3. 3. Stream Processing: Processing events as soon as they happen.. ● Stateless Processing ■ Transformation etc. ■ Lookup adjunct data (lookup databases/call services ) ■ Producing results for every event ● Stateful Processing ■ Triggering/Producing results periodically (time-windows) ● Maintain intermediate state ■ E.g. Joining across multiple streams of events. ● Common Issues ■ Scale !! Scale !! Scale !! ■ Reliability !! ■ Everything else (upgrades, debugging, diagnostics, security, ……)
  4. 4. Stream Processing: State of the Union MillwheelStorm Heron Spark Streaming S4 Dempsey Samza Flink Beam Dataflow Azure Stream Analytics AWS Kinesis AnalyticsGearPump Kafka Streams Orleans Not meant to be an accurate timeline.. Yes It is CROWDED !!
  5. 5. Apache Samza ● Top level Apache project since Dec 2014 ● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11) ● 62 Contributors ● 14 Committers ● Companies using : LinkedIn, Uber, MetaMarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By ● Applications at LinkedIn : from ~20 to ~200 in 2 years.
  6. 6. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  7. 7. Performance : Accessing Adjunct Data
  8. 8. Performance : Maintaining Temporary State
  9. 9. Performance : Let us talk numbers ! ● 100x Difference between using Local State vs Remote No-Sql store ● Local State details: ○ 1.1 Million TPS on a single processing machine (SSD) ○ Used a 3 node Kafka cluster for storing the durable changelog ● Remote State details: ○ 8500 TPS when the Samza job was changed to accessing a remote No-Sql store ○ No-Sql Store was also on a 3 node (ssd) cluster
  10. 10. Remote State : Asynchronous Event Processing Event Loop (Single thread) ProcessAsync Remote DB /Services Asynchronous I/O calls, using Java Nio, Netty... Responses sent to main thread via callback Event loop is woken up to process next message Task.max.concurrency >1 to enable pipelining Available with Samza 0.11
  11. 11. Remote State: Synchronous Processing on Multiple Threads Event Loop (Single thread) Schedule Process() Remote DB/ Services Built-In Thread pool Blocking I/O calls Event loop is woken up by the worker thread job.container.thread.pool.size = N Available with Samza 0.11
  12. 12. Incremental Checkpointing : MVP for stateful apps Input stream(e.g. Kafka)
  13. 13. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  14. 14. Speed Thrills .. but can kill ● Local State Considerations: ○ State should NOT be reseeded under normal operations (e.g. Upgrades, Application restarts) ○ Minimal State should be reseeded - If a container dies/removed - If a container is added
  15. 15. How Samza keeps Local state ‘stable’ ? Samza Job Input Stream Change-log Enable Continuous Scheduling
  16. 16. ● Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline. ● Allows each stage to be independent of the next stage Backpressure in a Pipeline
  17. 17. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  18. 18. Pluggable system consumers … Azure EventHub, Azure Document DB, Google Pub-Sub etc.
  19. 19. Batch processing in Samza!! (NEW) ● HDFS system consumer for Samza ● Same Samza processor can be used for processing events from Kafka and HDFS with no code changes ● Scenarios : ○ Experimentation and Testing ○ Re-processing of large datasets ○ Some datasets are readily available on HDFS (company specific)
  20. 20. Samza - HDFS support HDFS input HDFS output HDFS output HDFS input New Available since Samza 0.10 The batch job auto-terminates when the input is fully processed.
  21. 21. Brooklin Brooklin set offset=0
  22. 22. Backup Databus Database Backup (HDFS)
  23. 23. Samza batch pipelines HDFS output HDFS input HDFS output HDFS input
  24. 24. Samza- HDFS Early Performance Results !! Benchmark : Count number of records grouped by <Field> DataSize (bytes): 250 GB Number of files : 487 Samza Map/Reduce Spark Number of Containers T i m e -s e c o n d s
  25. 25. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources (batch and streaming) ● Stream processing as a service AND (coming soon) as an embedded library
  26. 26. Stream Processing as a Service ● Based on YARN ○ Yarn-RM high availability ○ Work preserving RM ○ Support for Heterogenous hardware with Node Labels (NEW) ● Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application. ● Disk Quotas for local state (e.g. rocksDB state) ● Samza Management Service(SAMZA-REST)-> Next Slide
  27. 27. YARN Resource Managers Nodes in the YARN cluster RM SRR RM SRR RM SRR NM SRN Samza Management Service (Samza REST) (NEW) NM SRN NM SRN NM SRN NM SRN NM SRN NM SRN NM SRN /v1/jobs /v1/jobs /v1/jobs Samza Containers 1. Exposes /jobs resource to start, stop, get status of jobs etc. 2. Cleans up stores from dead jobs Samza REST YARN processes(RM/NM)
  28. 28. Agenda 1. Stream processing 2. State of the union 3. Apache Samza : Key differentiators 4. Apache Samza Futures
  29. 29. Coming Soon : Samza as a Library Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator ... Leader ● No YARN dependency ● Will use ZK for leader election ● Embed stream processing into your bigger application StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”); processor.start(); processor.awaitStart(); … processor.stop();
  30. 30. Coming Soon: High Level API and Event Time (SAMZA-914/915) Count the number of PageViews by Region, every 30 minutes. @Override public void init(Collection<SystemMessageStream> sources) { sources.forEach(source -> { Function<PageView, String> keyExtractor = view -> view.getRegion(); source.map(msg -> new PageViewMessage(msg)) .window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor, WindowType.Tumbling, 30*60 )) }); }
  31. 31. Coming Soon: First class support for Pipelines (Samza- 1041) public class MyPipeline implements PipelineFactory { public Pipeline create(Config config) { Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config); Stream inStream = getStream(config, “inStream1”); // … omitted for brevity PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams(myShuffler, inStream1) .addOutputStreams(myShuffler, intermediateOutStream) .addInputStreams(myJoiner, intermediateOutStream, inStream2) .addOutputStreams(myJoiner, finalOutStream) .build(); } } Shuffle Join input output
  32. 32. Future: Miscellaneous ● Exactly once processing ● Making it easier to auto-scale even with Local State (on-demand Standby containers) ● Turnkey Disaster Recovery for stateful applications ○ Easy Restore of changelog and checkpoints from some other datacenter. ● Improved support for Batch jobs ● SQL over Streams ● A default Dashboard :)
  33. 33. Questions ?