Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Future-proof, portable batch
and streaming pipelines
using Apache Beam
Malo Deniélou
Senior Software Engineer at Google
ma...
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified A...
Why use Apache Beam?
1. Unified: The full spectrum from batch to streaming
2. Extensible: Transform libraries, IO Connecto...
A free-to-play gaming analytics use case
● How many players made it to stage 12?
● What path did they take through the sta...
The Solution
Collect real-time
game events and
player data
Combine data in
meaningful ways
Apply realtime
and historical
b...
How this would look on Google Cloud
Lots of alternatives …
No
API!
… but I need to rewrite all my pipelines!
Here comes Beam: your portable data processing API
No technological or environmental lock-in!
What is being computed?
Extensible PTransforms let you
build pipelines modularly.
What is being computed?
Where in
event time?
Leaderboard
Streaming
example
Demo!
Beam, Dataflow, Spark, Kafka, Flink, ...
● Beam’s Java SDK runs on multiple
runtime environments, including:
○ Apache Apex
○ Apache Spark
○ Apache Flink
○ Google C...
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Getting Started with Apache Beam
Beaming into the Future
The Beam Model
The Dataflow Model paper from VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
Streaming 101 and 1...
Apache Beam: http://beam.apache.org
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
...
Thanks!
Additional slides
Unified
The full spectrum from batch to streaming
Less code, better code
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When i...
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Su...
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggerin...
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggerin...
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Extensible
Transform libraries, IO connectors, and DSLs -- Oh my!
PTransforms
● PTransforms let you build pipelines modularly.
● All transformations are equal, so it’s easy to add new
libr...
Source API
● Source API allows users to
teach the system about new
bounded and unbounded
data formats.
More details on the...
Language-specific SDKS and DSLs
Language B
SDK
Language A
SDK
Language C
SDK
The Beam Model
DSL X DSL Z
● Multiple languag...
Portable
Write once, run anywhere
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Language B
SDK
...
Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on ...
Demo screenshots
because if I make them, I won’t need to use them
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Upcoming SlideShare
Loading in …5
×

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

256 views

Published on

Apache Beam is a top-level Apache project which aims at providing a unified API for efficient and portable data processing pipeline. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, Apache Apex, ...) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, describe the main concepts of the programming model and talk about the current state of the project (new python support, first stable version). We'll illustrate the concepts with a use case running on several runners.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

  1. 1. Future-proof, portable batch and streaming pipelines using Apache Beam Malo Deniélou Senior Software Engineer at Google malo@google.com Big Data Apps Meetup, May 2017
  2. 2. Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  3. 3. Why use Apache Beam? 1. Unified: The full spectrum from batch to streaming 2. Extensible: Transform libraries, IO Connectors, and DSLs -- oh my! 3. Portable: Write once, run anywhere 4. Demo: Beam, Dataflow, Spark, Kafka, Flink, ... 5. Getting Started: Beaming into the Future
  4. 4. A free-to-play gaming analytics use case ● How many players made it to stage 12? ● What path did they take through the stage? ● Team points and other stats at this point in time? ● Of the players who took the same route where a certain condition was true, how many made an in-app purchase? ● What are the characteristics of the player segment who didn’t make the purchase vs. those who did? ● Why was this custom event so successful in driving in-app purchases compared to others? You need Key indicators specific to your game in order to increase adoption, engagement, etc.
  5. 5. The Solution Collect real-time game events and player data Combine data in meaningful ways Apply realtime and historical batch analysis = Impact engagement, retention, and spend.
  6. 6. How this would look on Google Cloud
  7. 7. Lots of alternatives …
  8. 8. No API! … but I need to rewrite all my pipelines!
  9. 9. Here comes Beam: your portable data processing API No technological or environmental lock-in!
  10. 10. What is being computed?
  11. 11. Extensible PTransforms let you build pipelines modularly.
  12. 12. What is being computed? Where in event time?
  13. 13. Leaderboard Streaming example
  14. 14. Demo! Beam, Dataflow, Spark, Kafka, Flink, ...
  15. 15. ● Beam’s Java SDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Spark ○ Apache Flink ○ Google Cloud Dataflow ○ [in development] Apache Gearpump ● Cross-language infrastructure is in progress. ○ Beam’s Python SDK currently runs on Google Cloud Dataflow ● First stable version at the end of May! Beam Vision: as of May 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  16. 16. How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  17. 17. Beam: the intersection of runner functionality?
  18. 18. Beam: the union of runner functionality?
  19. 19. Beam: the future!
  20. 20. Getting Started with Apache Beam Beaming into the Future
  21. 21. The Beam Model The Dataflow Model paper from VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf Streaming 101 and 102: The World Beyond Batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Beam
  22. 22. Apache Beam: http://beam.apache.org Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Detailed documentation
  23. 23. Thanks!
  24. 24. Additional slides
  25. 25. Unified The full spectrum from batch to streaming Less code, better code
  26. 26. Processing time vs. event time
  27. 27. The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  28. 28. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  29. 29. The Beam Model: What is being computed?
  30. 30. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  31. 31. The Beam Model: Where in event time?
  32. 32. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  33. 33. The Beam Model: When in processing time?
  34. 34. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  35. 35. The Beam Model: How do refinements relate?
  36. 36. Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  37. 37. Extensible Transform libraries, IO connectors, and DSLs -- Oh my!
  38. 38. PTransforms ● PTransforms let you build pipelines modularly. ● All transformations are equal, so it’s easy to add new libraries. ● Runtime can use this structure to communicate with the user. More details on the Beam Blog: Where’s My PCollection.map()?
  39. 39. Source API ● Source API allows users to teach the system about new bounded and unbounded data formats. More details on the Beam Blog: Dynamic Work Rebalancing for Beam ● Even the most careful hand-tuning will fail as data, code, and environments shift. ● Beam’s Source API is designed to provide runtime hooks for efficient scaling. Processed & Committed Processed and Uncommitted Unprocessed
  40. 40. Language-specific SDKS and DSLs Language B SDK Language A SDK Language C SDK The Beam Model DSL X DSL Z ● Multiple language-specific SDKs that implement the full Beam model. ● Optional domain specific languages that narrow or transform the abstractions to align with simple use cases or specific user communities.
  41. 41. Portable Write once, run anywhere
  42. 42. Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache Beam Language B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of API: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runtime: Users choose the right runner for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
  43. 43. Example Beam Runners Apache Spark ● Open-source cluster-computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  44. 44. Demo screenshots because if I make them, I won’t need to use them

×