Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

701 views

Published on

Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.

Published in: Data & Analytics
  • Be the first to comment

Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

  1. 1. Apache Beam (incubating) PPMC Senior Software Engineer, Google Dan Halperin Introduction to Apache Beam & No shard left behind: APIs for massive parallel efficiency
  2. 2. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  3. 3. What is Apache Beam? The Beam Programming Model •What / Where / When / How
 SDKs for writing Beam pipelines •Java, Python Beam Runners for existing distributed processing backends •Apache Apex •Apache Flink •Apache Spark •Google Cloud Dataflow The Beam Programming Model SDKs for writing Beam pipelines •Java, Python Beam Runners for existing distributed processing backends What is Apache Beam? Google Cloud Dataflow Apache Apex Apache Apache Gearpump Apache Cloud Dataflow Apache Spark Beam Model: Fn Runners Apache Flink Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Apache Gearpump Execution Apache Apex
  4. 4. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  5. 5. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Quick overview of the Beam model
  6. 6. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. Quick overview of the Beam model
  7. 7. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. Quick overview of the Beam model
  8. 8. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Quick overview of the Beam model
  9. 9. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Quick overview of the Beam model
  10. 10. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Window – reassign elements to zero or more windows; may be data-dependent. Quick overview of the Beam model
  11. 11. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Window – reassign elements to zero or more windows; may be data-dependent. Triggers – user flow control based on window, watermark, element count, lateness. Quick overview of the Beam model
  12. 12. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 4. Streaming with Speculative + Late Data 5. Sessions
  13. 13. Data: JSON-encoded analytics stream from site •{“user”:“dhalperi”, “page”:“oreilly.com/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …} Desired output: Per-user session length and activity level •dhalperi, 17 pageviews, 2016-08-31 15:00-15:35 Other pipeline goals: •Live data – can track ongoing sessions with speculative output
 dhalperi, 10 pageviews, 2016-08-31 15:00-15:15 (EARLY) •Archival data – much faster, still correct output respecting event time Simple clickstream analysis pipeline
  14. 14. Simple clickstream analysis pipeline PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey()); userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); pipeline.run();
  15. 15. Unified unbounded & bounded PCollections pipeline.apply(IO.Read(…)).apply(MapElements.of(new ParseClicksAndAssignUser())); Apache Kafka, ActiveMQ, tailing filesystem, … •A live, roughly in-order stream of messages, unbounded PCollections. •KafkaIO.read().fromTopic(“pageviews”)
 HDFS, Google Cloud Storage, yesterday’s Kafka log, … • Archival data, often readable in any order, bounded PCollections. • TextIO.read().from(“gs://oreilly/pageviews/*”)
  16. 16. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Event time 3:00 3:05 3:10 3:15 3:20 3:25
  17. 17. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Event time 3:00 3:05 3:10 3:15 3:20 3:25 One session, 3:04-3:25
  18. 18. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25
  19. 19. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:10(EARLY)
  20. 20. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
  21. 21. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
  22. 22. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY) Watermark
  23. 23. Writing output, and bundling userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); Writing is the dual of reading — format and then output. Not its own primitive, but implemented based on sink semantics using ParDo + graph structure.
  24. 24. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline
  25. 25. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours
  26. 26. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours What does the user have to change to get these results?
  27. 27. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours What does the user have to change to get these results? A: O(10 lines of code) + Command-line Arguments
  28. 28. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  29. 29. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” Beam abstractions empower runners
  30. 30. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners
  31. 31. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/*
  32. 32. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Size?
  33. 33. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* 50 TiB
  34. 34. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? 50 TiB
  35. 35. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? Split
 1TiB
  36. 36. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? filenames
  37. 37. A bundle is a group of elements processed and committed together. Bundling and runner efficiency
  38. 38. A bundle is a group of elements processed and committed together. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle()
  39. 39. A bundle is a group of elements processed and committed together. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  40. 40. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  41. 41. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient,
 long synchronous stages. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  42. 42. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient,
 long synchronous stages. Other runner strategies strike a different balance. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  43. 43. Beam triggers are flow control, not instructions. •“it is okay to produce data” not “produce data now”. •Runners decide when to produce data, and can make local choices for efficiency. Streaming clickstream analysis: runner may optimize for latency and freshness. • Small bundles and frequent triggering → more files and more (speculative) records. Batch clickstream analysis: runner may optimize for throughput and efficiency. • Large bundles and no early triggering → fewer large files and no early records. Runner-controlled triggering
  44. 44. Pipeline workload varies Workload Time Streaming pipeline’s input varies Batch pipelines go through stages
  45. 45. Perils of fixed decisions Workload Time Time Over-provisioned / worst case Under-provisioned / average case
  46. 46. Ideal case Workload Time
  47. 47. The Straggler Problem Worker Time Work is unevenly distributed across tasks. Reasons: •Underlying data. •Processing. •Runtime effects. Effects are cumulative per stage.
  48. 48. Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Sample extensively and then split? Worker Time
  49. 49. Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Sample extensively and then split? All of these have major costs, none is a complete solution. Worker Time
  50. 50. No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time.
 A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.
  51. 51. Readers provide simple progress signals, enable runners to take action based on execution-time characteristics. APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes() Work-stealing: •Bounded: Source splitAtFraction(double)
 int getParallelismRemaining() Beam readers enable dynamic adaptation
  52. 52. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Time Average
  53. 53. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Time
  54. 54. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Now
  55. 55. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Now
  56. 56. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice
  57. 57. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice Same pipeline
 dynamic work rebalancing enabled
  58. 58. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice Same pipeline
 dynamic work rebalancing enabled Savings
  59. 59. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling
  60. 60. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size
  61. 61. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting
  62. 62. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting Upscale to 1000 workers
 * tasks stay well-balanced * without oversplitting initially
  63. 63. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting Long-running tasks aborted without causing stragglers Upscale to 1000 workers
 * tasks stay well-balanced * without oversplitting initially
  64. 64. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  65. 65. Write a pipeline once, run it anywhere PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey()); userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); pipeline.run(); Unified model for portable execution
  66. 66. Demo time
  67. 67. Building the community If you analyze data: Try it out! If you have a data storage or messaging system: Write a connector!
 If you have Big Data APIs: Write a Beam SDK, DSL, or library! If you have a distributed processing backend: •Write a Beam Runner!
 (& join Apache Apex, Flink, Spark, Gearpump, and Google Cloud Dataflow) Get involved with Beam
  68. 68. Apache Beam (incubating): •https://beam.incubator.apache.org/ •documentation, mailing lists, downloads, etc. Read our blog posts! •Streaming 101 & 102: oreilly.com/ideas/the-world-beyond-batch-streaming-101 •No shard left behind: Dynamic work rebalancing in Google Cloud Dataflow •Apache Beam blog: http://beam.incubator.apache.org/blog/ Follow @ApacheBeam on Twitter Learn more

×