Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stateful processing of massive out-of-order streams with Apache Beam


Published on

With Apache Beam, you can process massive out-of-order streams (or standard batch use cases too) by defining high-level transformation pipelines that you can then run on a variety of backends, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

This talk introduces a new feature of the Beam programming model: stateful processing with processing-time and event-time timers. This enhancement unlocks new use cases and efficiencies, such as:

- Micro-service like workflows ("register this user, remind them after a day, and expire their sign up after a week")
- Customized output control ("only output when the signal has changed by more than 0.3")
- Carefully batched RPCs ("write as many items as possible at the same time, but no more than 500")
- Stream joins with custom output triggering ("join these two streams on an arbitrary join predicate with correct exactly once results")

In this talk, you will learn how to use Beam to develop complex, stateful pipelines to easily implement scenarios like the above, which you can finely tailor to your precise use case.

Published in: Technology
  • Be the first to comment

Stateful processing of massive out-of-order streams with Apache Beam

  1. 1. Stateful processing of massive out-of-order streams in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google / @kennknowles Dataworks Summit SJC 2017 1
  2. 2. Agenda 1. Massive out-of-order streams 2. Apache Beam for streams 3. Portable stateful processing with Beam 2
  3. 3. Massive out-of-order streams 3
  4. 4. Massive Out-of-order Streams Computation 4
  5. 5. Computation Massive Out-of-order Streams 5
  6. 6. Massive Out-of-order Streams 6
  7. 7. Massive Out-of-order Streams 7
  8. 8. Massive Out-of-order Streams 8
  9. 9. Use cases for massive out-of-order streams ● Operations and manufacturing ● Mobile gaming ● Web analytics ● Wearables ● Automotive ● Power grid ● Network monitoring ● (Mobile) banking … anything processing "events that happen" (you can also process things that aren't events; just use fewer features) 9
  10. 10. Apache Beam for Streams 10
  11. 11. Are you building one of these? 11
  12. 12. Are you building one of these? 12 Filter Join
  13. 13. 20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011 MapReduce (paper) Apache Hadoop Dataflow Model (paper) MillWheel (paper) Heron Apache Spark Apache Storm Apache Gearpump (incubating) Apache Apex Apache Flink Cloud Dataflow FlumeJava (paper) Apache Beam Which one? Apache Samza 13
  14. 14. The Beam Vision Sum Per Key input.apply( Sum.integersPerKey()) Java input | Sum.PerKey() Python ⋮ Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Flink local, on-prem, cloud ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating) 14
  15. 15. The Beam Vision KafkaIO Python ⋮ class KafkaIO extends UnboundedSource { … } Java Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Flink local, on-prem, cloud ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating) 15
  16. 16. The Beam Model What are you computing? (read, map, reduce) 16 Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) The focus of today
  17. 17. Per element ParDo (Map, etc) 17 Every item processed independently Stateless implementation
  18. 18. Per key Combine (Reduce, etc) 18 Items grouped by some key and combined Stateful streaming implementation (buffering until trigger) But your code doesn't work with state, just associative & commutative function
  19. 19. It "just works" with massive out-of-order streams 19 ParDo, Map, etc. Combine, Reduce, etc. "Parse incoming events and filter out bad data" "Sum per hour and output when you have the whole hour" "Put events in 10 minute windows sliding every 2 minutes" "Group into sessions and emit as fast as possible"
  20. 20. But what if you need more control? 20 ParDo, Map, etc. Combine, Reduce, etc. "I need some state on the side to tweak my FlatMap's behavior" "My aggregation is not an associative & commutative operator" "Triggers aren't specific enough for my use case" "I need to output even when data isn't coming in"
  21. 21. Portable Stateful Processing With Beam 21
  22. 22. What if you need more control? 22 ParDo, Map, etc. Combine, Reduce, etc. ProcessFunction MapWithState Operator … that "just works" with out-of-order events … is portable across engines Timers State State & timers for ParDo!
  23. 23. Example: time-batched requests output return value of batched RPC buffer request batched requests On Timer 23 "call me back in 500ms" On Element
  24. 24. User's view of your transform On Timer On Element 24 Some requests (try to contain costs) Events come in (out of order, windowing specified) Correct windowed output (don't care how you got them) input .apply(Window.into( hours ) .apply(new EnrichEvents())
  25. 25. Event time windowing still "just works" 25 Window into Fixed windows of 1 hour Window into 30 min sliding by 10 min
  26. 26. Key Window MEDIAN_IDLE MAIN_ACTIVITY ... "kenn" 9am - 10am 10m "hack" 12pm - 1pm 25m "eat" 11pm - 12am 60m "sleep" "tgroh" 8am - 9am 20m "bike" 11am - 12pm 3m "hack" ... ... State is per key and window Bonus: automatically garbage collected when a window expires (vs manual clearing of per-key state) 26
  27. 27. Unified present & historical processing 27 Same input data Equivalent results
  28. 28. ● Domain-specific triggering ("output when five people who live in Seattle have checked in") ● Slowly changing dimensions ("update FX rates for currency ABC") ● Stream joins ("join-matrix" / "join-biclique") ● Fine-grained aggregation ("add odd elements to accumulator A and event elements to accumulator B") ● Per-key workflows (like user sign up flow w/ reminders & expiration) What else can you do with state & timers 28
  29. 29. Summary Stateful processing in Beam... ● … unlocks new uses cases ● … is portable across data processing engines ● … works with event time windowing ● … works for present and historical data 29
  30. 30. Thank you for listening! This talk: ● Me - @KennKnowles / ● These Slides - Go Deeper ● Design - ● Blog - Join the Beam community: ● User discussions - ● Development discussions - ● Follow @ApacheBeam on Twitter 30