Samza Demo @scale 2017


The slides used for Samza Demo in 2017 @scale conference.

  1. 1. Apache Kafka • 2.1 Trillion messages ingested per day • 0.5 PB in, 2 PB out per day (compressed) • 16 million msg/sec peaks Apache Samza • Over 500 applications running in production, • With 10000+ containers • Applications with several TB of local state 1 Scale of Event Processing at LinkedIn
  2. 2. Best in Class Support for Stateful Stream Processing • Incremental checkpointing for large state and fast recovery. • Local state that works seamlessly across upgrades and failures. • Async Processing for efficient remote I/O Hardened at Internet Scale • In use at LinkedIn, Uber, Netflix, Intuit, Metamarkets, TripAdvisor, VMWare, Optimizely, Redfin, etc. • Processing events from Kafka, Kinesis, EventHub, HDFS, ZeroMQ, DynamoDB Streams, MongoDB, Databus, Brooklin etc. Why Apache Samza ? 2 Unified API For Stream and Batch Processing • Process data in streams or in hadoop without any code changes. Run as a Service or a Library • Write once run anywhere. • Deploy in a managed cluster, or embed as a library in another application.
  3. 3. Stream (data in motion) Processing • Click Stream Processing, Interactive User Feeds • Security, Fraud Detection • Application Monitoring • Internet of Things • Ads, Gaming, Trading etc. Security 3
  4. 4. Multi-Stage Dataflow Example 4 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo public class PageViewCountApplication implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewStream" ); MessageStream pageViewPerMember = graph.getOutputStream("pageViewPerMemberStream" ); pageView .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewPerMember); } } built-in transform functions
  5. 5. Stream Application in Batch Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ 5 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
  6. 6. Stream Processing as a Library 6 Page View Page View per Member Repartition by member id Window Map SendTo Launch Stream Processor public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); PageViewCountApplication app = new PageViewCountApplication();; runner.waitForFinish(); } job.coordinator.factory=org.apache.samza.zk. ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191 Zero code changes
  7. 7. Apache Kafka Real Time Processing (Apache Samza) Processing Espresso Services Tier Ingestion Clients(browser,devices ….) Brooklin Oracle AWS Kinesis Azure EventHub Data Ingestion at LinkedIn 7
  8. 8. Backup 8
  9. 9. Local State -- Throughput 9 remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead
  10. 10. Failure Recovery 10 ~ constant overhead with Host Affinity parallel recovery: equal recovery time irrespective of # failed containers
  11. 11. Samza HDFS Benchmark Profile count, group-by country 500 files 250GB input