Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tale of two streaming frameworks (Karthik D - Walmart)

118 views

Published on

Speaker: Karthik Deivasigamani (https://www.linkedin.com/in/karthikdeivasigamani/)

Talk presented during Bangalore Kafka group's stream processing meetup at Walmart
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/266777028/

Published in: Technology
  • Be the first to comment

Tale of two streaming frameworks (Karthik D - Walmart)

  1. 1. 1 Tale of two stream processing frameworks Apache Storm & Apache Flink Karthik Deivasigamani @WalmartLabs
  2. 2. 2 Streaming • Stream – Continuous flow • Streaming Data – Streaming data is data that is continuously generated by different sources. – Unbounded data • Stream Processing – processing of data in motion, or in other words, computing on data directly as it is produced or received – data processing engine that is designed with infinite data sets in mind
  3. 3. 3 Retail Data • Catalog Data • Pricing Data • Clickstream logs • Payments • Order Data • Inventory • Delivery Logistics
  4. 4. 4 Not so long ago.. • Data submitted as feeds • Periodic Data Collection • Data Processed In Batches • Runs offline • Delay between actual time & processing time • Failures
  5. 5. 5 Need For Speed – Fast Data • Catalog Updates • Price Updates • Fraud Detection • Out of stock • Delivery alerts • Personalization
  6. 6. 6
  7. 7. 7 Catalog Use Case
  8. 8. 8 Catalog Functions • Normalization • Classification • Product Matching • Shelving • Attribute Extraction • Grouping • Image
  9. 9. 9 Characteristics of ingestion pipeline • Zero message loss • Fault Tolerance • Source based priority queue • Scale to millions of product updates/hour • Near Real Time Updates • Checkpoint at various stages
  10. 10. 10 Apache Storm • Created by Nathan Marz • Stream Abstraction • Spouts, Bolts, Topology • Trident • Kafka Integration • Message processing guarantees
  11. 11. 11 Storm Cluster • Nimbus – distributing code – assigning tasks to machines – monitoring for failures • Supervisor – communicates with Nimbus through Zookeeper – starts and stops workers according to signals from Nimbus • Zookeeper – Coordinates the storm cluster
  12. 12. 12 Key Concepts • Tuples – Named list of values where each value can be any type. • Stream – unbounded sequence of tuples • Spout – sources of streams in a computation • Bolts – process input streams and produce output streams • Topology – DAG - network of spouts and bolts
  13. 13. 13 Stream Grouping • Shuffle Grouping • Fields Grouping • All grouping • Global Grouping • Local or Shuffle grouping • Direct Grouping
  14. 14. 14 Parallelism of a Storm Topology • Worker processes – Executes a subset of a topology • Executors (Threads) – Is a thread that is spawned by a worker process. – It may run one or more tasks for the same component (spout or bolt). • Tasks – performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster
  15. 15. 15 Guaranteeing Message Processing
  16. 16. 16 Micro Service vs Bolt • Choice of language • Teams operate independently • Platform with pluggable services Bolt
  17. 17. 17 Catalog Pipeline
  18. 18. 18 Challenges • Validations at various stages • Async IO using RxJava, Hystrix • Hystrix Circuit Breaker • Failing Tuples • Fetch-size, increase workers, increase bolt parallelism • Data Errors • Services taking longer • Service outage • Fatal Errors • Spike in traffic
  19. 19. 19 Lessons Learnt • Things will fail • Monitor everything • Automation • Scale is not a feature • Logs don’t lie
  20. 20. 20
  21. 21. 21 Pricing Use Case • Competitive pricing (EDLP) • Seller price updates • Handle spike during holidays • Promotions • Anomaly Detection • Accuracy
  22. 22. 22 Characteristics of ingestion pipeline • Exactly Once • Order Guarantee • Stateful • Handle tens of millions of updates/hour • NRT price update on website • Traceability
  23. 23. 23 Apache Flink • Project Stratosphere in Universities around Berlin • data Artisans founded in 2014 • Process Unbounded and Bounded Data • Exactly Once • Stateful & Flexible API • Alibaba was using it at scale
  24. 24. 24 Apache Flink - Overview • Data source: Incoming data that Flink processes • Transformations: The processing step, when Flink modifies incoming data • Data sink: Where Flink sends data after processing
  25. 25. 25 Apache Flink - Runtime Footer
  26. 26. 26 Stateful Stream Processing • "state" is shared between events. • Past events can influence the way current events are processed. • Embedded database (Rocks DB) for state. • Local state needs to be protected against failures to avoid data loss. • Checkpointing to guarantee persistence of state.
  27. 27. 27 Flink Checkpointing (Chandy-Lamport Algorithm)
  28. 28. 28 Exactly Once - Explained • The label “exactly-once” is misleading in describing what is done exactly once. • No Stream Processing can guarantee exactly-once event processing. • Flink guarantees exactly-once state updates. • Flink uses Chandy and Lamport Algorithm, to draw consistent snapshots of current state to create a checkpoint. • Flink restarts an application using the most recently completed checkpoint as a starting point.
  29. 29. 29 Duplicate Events
  30. 30. 30 Pricing Pipeline
  31. 31. 31 Challenges • HTTP/DB lookup calls • Huge payload choking network • Isolation • Buffer bloat • Async I/O Operator • Operator Chaining • Mesos / YARN • taskmanager.memory.segment-size
  32. 32. 32 What we learnt • Flink is fast, APIs are super easy to use. • Avoid network shuffle and use forward / operator chaining. • Use accumulators to monitor the progress of your application. • Checkpoint failures indicate that your application is running slow. • Monitor everything – lag, checkpoints, latency etc • For application inherently slow configure your buffers to accommodate for buffer bloat, so that checkpoints don’t fail. • Join the flink users mailing list and ask questions!
  33. 33. 33 Apache Storm vs Apache Flink Feature Winner True streaming Yes Yes Tie Speed Fast Amazingly fast Overall maturity Very stable, haven’t really encountered storm bugs that hit us in production. Little behind – ran into lots of fink bugs, some of it is addressed now. API Used to be very primitive with until 1.0 Rich API and you can achieve lot by writing very few lines of code. Windowing, Join They added support in 1.2 Excellent out of the box support for windowing and join. Tie Monitoring / Deployment Better isolation of jobs with the process model You need YARN/Mesos to get better isolation. Tie (assumes you are running Flink on YARN) Stateful Stream processing WIP (apache storm 2.0) Supported with rocksdb. You can also query the state outside your stream processing system. Message Processing Guarantee Supports - At least once, At most once, Exactly once (need trident) Supports - At least once, At most once, Exactly Once (state is touched exactly once) Tie Backpressure Max spout pending can be used to adjust Handle automatically Async IO support No native support Out of the box Streaming SQL WIP (apache storm 2.0) Very early stage -
  34. 34. 34 What should I pick
  35. 35. 35 Future of streaming - Cloud Amazon Kinesis Streams Functions as stream processors Cloud Flow Confluent Cloud Event Hub – Kafka Compatible
  36. 36. 36 Thank You! Yes, we are hiring! https://indiacareers.walmartlabs.com/

×