Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Samza Demo @scale 2017


Published on

The slides used for Samza Demo in 2017 @scale conference.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Samza Demo @scale 2017

  1. 1. Apache Kafka • 2.1 Trillion messages ingested per day • 0.5 PB in, 2 PB out per day (compressed) • 16 million msg/sec peaks Apache Samza • Over 500 applications running in production, • With 10000+ containers • Applications with several TB of local state 1 Scale of Event Processing at LinkedIn
  2. 2. Best in Class Support for Stateful Stream Processing • Incremental checkpointing for large state and fast recovery. • Local state that works seamlessly across upgrades and failures. • Async Processing for efficient remote I/O Hardened at Internet Scale • In use at LinkedIn, Uber, Netflix, Intuit, Metamarkets, TripAdvisor, VMWare, Optimizely, Redfin, etc. • Processing events from Kafka, Kinesis, EventHub, HDFS, ZeroMQ, DynamoDB Streams, MongoDB, Databus, Brooklin etc. Why Apache Samza ? 2 Unified API For Stream and Batch Processing • Process data in streams or in hadoop without any code changes. Run as a Service or a Library • Write once run anywhere. • Deploy in a managed cluster, or embed as a library in another application.
  3. 3. Stream (data in motion) Processing • Click Stream Processing, Interactive User Feeds • Security, Fraud Detection • Application Monitoring • Internet of Things • Ads, Gaming, Trading etc. Security 3
  4. 4. Multi-Stage Dataflow Example 4 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo public class PageViewCountApplication implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewStream" ); MessageStream pageViewPerMember = graph.getOutputStream("pageViewPerMemberStream" ); pageView .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewPerMember); } } built-in transform functions
  5. 5. Stream Application in Batch Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ 5 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
  6. 6. Stream Processing as a Library 6 Page View Page View per Member Repartition by member id Window Map SendTo Launch Stream Processor public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); PageViewCountApplication app = new PageViewCountApplication();; runner.waitForFinish(); } job.coordinator.factory=org.apache.samza.zk. ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191 Zero code changes
  7. 7. Apache Kafka Real Time Processing (Apache Samza) Processing Espresso Services Tier Ingestion Clients(browser,devices ….) Brooklin Oracle AWS Kinesis Azure EventHub Data Ingestion at LinkedIn 7
  8. 8. Backup 8
  9. 9. Local State -- Throughput 9 remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead
  10. 10. Failure Recovery 10 ~ constant overhead with Host Affinity parallel recovery: equal recovery time irrespective of # failed containers
  11. 11. Samza HDFS Benchmark Profile count, group-by country 500 files 250GB input