SlideShare a Scribd company logo
Portable Streaming
Pipelines with Apache Beam
Frances Perry
PMC for Apache Beam, Tech Lead at Google
Kafka Summit, May 2017
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
The evolution of Apache Beam
MapReduce
Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
Agenda
1. Beam Model: Model Basics
2. Extensible IO Connectors
3. Portability: Write Once, Run Anywhere
4. Demo
5. Getting Started
Model Basics
A unified model for batch and streaming
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Extensible IO Connectors
Like Kafka!
The Beam vision for portablility
Write once, run anywhere
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache BeamLanguage B
SDK
Language A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
● Beam’s Java SDK runs on multiple
runtime environments, including:
• Apache Apex
• Apache Spark
• Apache Flink
• Google Cloud Dataflow
• [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
• Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Vision: as of March 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
Example Beam Runners
Apache Spark
● Open-source cluster-
computing framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Categorizing Runner Capabilities
http://beam.incubator.apache.org/
documentation/runners/capability-matrix/
Parallel and portable pipelines in practice
Demo!
Getting Started with Apache Beam
Beaming into the Future
Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation
Learn more!
Apache Beam
https://beam.apache.org
Join the Beam mailing lists
user-subscribe@beam.apache.org
dev-subscribe@beam.apache.org
Follow @ApacheBeam on Twitter
Demo screenshots
because if I make them, I won’t need to use them
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam

More Related Content

What's hot

AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪
AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪
AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪崇之 清水
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
ChatGPTをもっと使いたい.pptx
ChatGPTをもっと使いたい.pptxChatGPTをもっと使いたい.pptx
ChatGPTをもっと使いたい.pptx
TokioMiyaoka
 
Ethics of AI - AIの倫理-
Ethics of AI - AIの倫理-Ethics of AI - AIの倫理-
Ethics of AI - AIの倫理-
Daiyu Hatakeyama
 
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
Amazon Web Services Japan
 
Secured API Acceleration with Engineers from Amazon CloudFront and Slack
Secured API Acceleration with Engineers from Amazon CloudFront and SlackSecured API Acceleration with Engineers from Amazon CloudFront and Slack
Secured API Acceleration with Engineers from Amazon CloudFront and Slack
Amazon Web Services
 
CloudFront経由でのCORS利用
CloudFront経由でのCORS利用CloudFront経由でのCORS利用
CloudFront経由でのCORS利用
Yuta Imai
 
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
shinjiigarashi
 
php-src の歩き方
php-src の歩き方php-src の歩き方
php-src の歩き方
do_aki
 
Open ebs 101
Open ebs 101Open ebs 101
Open ebs 101
LibbySchulze
 
dm-writeboost-kernelvm
dm-writeboost-kernelvmdm-writeboost-kernelvm
dm-writeboost-kernelvm
Akira Hayakawa
 
高負荷に耐えうるWeb application serverの作り方
高負荷に耐えうるWeb application serverの作り方高負荷に耐えうるWeb application serverの作り方
高負荷に耐えうるWeb application serverの作り方
yuta-ishiyama
 
Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念
Shinya Mori (@mosuke5)
 
【基礎編】社内向けMySQL勉強会
【基礎編】社内向けMySQL勉強会【基礎編】社内向けMySQL勉強会
【基礎編】社内向けMySQL勉強会
Yuji Otani
 
gRPC入門
gRPC入門gRPC入門
gRPC入門
Kenjiro Kubota
 
rsyncで差分バックアップしようぜ!
rsyncで差分バックアップしようぜ!rsyncで差分バックアップしようぜ!
rsyncで差分バックアップしようぜ!
Hiro H.
 
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみるDSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
Atsushi KOMIYA
 
How A Compiler Works: GNU Toolchain
How A Compiler Works: GNU ToolchainHow A Compiler Works: GNU Toolchain
How A Compiler Works: GNU Toolchain
National Cheng Kung University
 
Amazon SNS+SQSによる Fanoutシナリオの話
Amazon SNS+SQSによる Fanoutシナリオの話Amazon SNS+SQSによる Fanoutシナリオの話
Amazon SNS+SQSによる Fanoutシナリオの話
Yoichi Toyota
 
ニワトリでもわかるECS入門
ニワトリでもわかるECS入門ニワトリでもわかるECS入門
ニワトリでもわかるECS入門
Yoshiki Kobayashi
 

What's hot (20)

AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪
AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪
AWS Elastic Beanstalk(初心者向け 超速マスター編)JAWSUG大阪
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
ChatGPTをもっと使いたい.pptx
ChatGPTをもっと使いたい.pptxChatGPTをもっと使いたい.pptx
ChatGPTをもっと使いたい.pptx
 
Ethics of AI - AIの倫理-
Ethics of AI - AIの倫理-Ethics of AI - AIの倫理-
Ethics of AI - AIの倫理-
 
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
AWS Black Belt Online Seminar 2016 AWS上でのファイルサーバ構築
 
Secured API Acceleration with Engineers from Amazon CloudFront and Slack
Secured API Acceleration with Engineers from Amazon CloudFront and SlackSecured API Acceleration with Engineers from Amazon CloudFront and Slack
Secured API Acceleration with Engineers from Amazon CloudFront and Slack
 
CloudFront経由でのCORS利用
CloudFront経由でのCORS利用CloudFront経由でのCORS利用
CloudFront経由でのCORS利用
 
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
モダン PHP テクニック 12 選 ―PsalmとPHP 8.1で今はこんなこともできる!―
 
php-src の歩き方
php-src の歩き方php-src の歩き方
php-src の歩き方
 
Open ebs 101
Open ebs 101Open ebs 101
Open ebs 101
 
dm-writeboost-kernelvm
dm-writeboost-kernelvmdm-writeboost-kernelvm
dm-writeboost-kernelvm
 
高負荷に耐えうるWeb application serverの作り方
高負荷に耐えうるWeb application serverの作り方高負荷に耐えうるWeb application serverの作り方
高負荷に耐えうるWeb application serverの作り方
 
Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念
 
【基礎編】社内向けMySQL勉強会
【基礎編】社内向けMySQL勉強会【基礎編】社内向けMySQL勉強会
【基礎編】社内向けMySQL勉強会
 
gRPC入門
gRPC入門gRPC入門
gRPC入門
 
rsyncで差分バックアップしようぜ!
rsyncで差分バックアップしようぜ!rsyncで差分バックアップしようぜ!
rsyncで差分バックアップしようぜ!
 
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみるDSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
 
How A Compiler Works: GNU Toolchain
How A Compiler Works: GNU ToolchainHow A Compiler Works: GNU Toolchain
How A Compiler Works: GNU Toolchain
 
Amazon SNS+SQSによる Fanoutシナリオの話
Amazon SNS+SQSによる Fanoutシナリオの話Amazon SNS+SQSによる Fanoutシナリオの話
Amazon SNS+SQSによる Fanoutシナリオの話
 
ニワトリでもわかるECS入門
ニワトリでもわかるECS入門ニワトリでもわかるECS入門
ニワトリでもわかるECS入門
 

Similar to Portable Streaming Pipelines with Apache Beam

Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
J On The Beach
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
ApacheBeam_Google_Theater_TalendConnect2017.pdf
ApacheBeam_Google_Theater_TalendConnect2017.pdfApacheBeam_Google_Theater_TalendConnect2017.pdf
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
ApacheBeam_Google_Theater_TalendConnect2017.pptx
ApacheBeam_Google_Theater_TalendConnect2017.pptxApacheBeam_Google_Theater_TalendConnect2017.pptx
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006
Randy Huang
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
Practices and tools for building better APIs
Practices and tools for building better APIsPractices and tools for building better APIs
Practices and tools for building better APIsNLJUG
 
Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)
Peter Hendriks
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu Behera
 

Similar to Portable Streaming Pipelines with Apache Beam (20)

Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
ApacheBeam_Google_Theater_TalendConnect2017.pdf
ApacheBeam_Google_Theater_TalendConnect2017.pdfApacheBeam_Google_Theater_TalendConnect2017.pdf
ApacheBeam_Google_Theater_TalendConnect2017.pdf
 
ApacheBeam_Google_Theater_TalendConnect2017.pptx
ApacheBeam_Google_Theater_TalendConnect2017.pptxApacheBeam_Google_Theater_TalendConnect2017.pptx
ApacheBeam_Google_Theater_TalendConnect2017.pptx
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Practices and tools for building better APIs
Practices and tools for building better APIsPractices and tools for building better APIs
Practices and tools for building better APIs
 
Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 

More from confluent

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
confluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
confluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
confluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
confluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
confluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
confluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
confluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
confluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
confluent
 

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 

Recently uploaded (20)

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 

Portable Streaming Pipelines with Apache Beam

  • 1. Portable Streaming Pipelines with Apache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017
  • 2. Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 3. The evolution of Apache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
  • 4. Agenda 1. Beam Model: Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started
  • 5. Model Basics A unified model for batch and streaming
  • 6. Processing time vs. event time
  • 7. The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 8. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 9. The Beam Model: What is being computed?
  • 10. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 11. The Beam Model: Where in event time?
  • 12. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 13. The Beam Model: When in processing time?
  • 14. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 15. The Beam Model: How do refinements relate?
  • 16. Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 18. The Beam vision for portablility Write once, run anywhere
  • 19. Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
  • 20. ● Beam’s Java SDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  • 21. Example Beam Runners Apache Spark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 22. How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 23. Beam: the intersection of runner functionality?
  • 24. Beam: the union of runner functionality?
  • 27. Parallel and portable pipelines in practice Demo!
  • 28. Getting Started with Apache Beam Beaming into the Future
  • 29. Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
  • 30. Learn more! Apache Beam https://beam.apache.org Join the Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
  • 31. Demo screenshots because if I make them, I won’t need to use them

Editor's Notes

  1. Good afternoon! My name is Frances Perry. I’m an engineer at Google and on the project management committee for Apache Beam. Today I’m going to give an introduction to Apache Beam… can you see me?
  2. -- which is a new open source project for expressing both batch and streaming data processing use cases When you use Beam, you’re focusing on your logic and your data, without letting runtime details leak through into your code. That separation means that a Beam pipeline can be run many existing runtimes that you know and love, including Apache Spark, Apache Flink, and Google Cloud Dataflow. To put Beam context in the broader BigData ecosystem, let’s talk briefly about its evolution.
  3. Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. <animate> Inside Google, kept innovating, but initially just publishing papers. In 2014, we released Google Cloud Dataflow -- included both a new programming model and fully managed service <animate> Externally the open source community created Hadoop, and an entire ecosystem flourished around this. Beam is bring these two streams of work. It’s based on the Dataflow programming model, but generalized and integrated with the broader ecosystem.
  4. Today I’m going to go into more detail on two key pieces of Apache Beam the programming model that intuitively expresses data-parallel operations, including both batch and streaming use cases the portability infrastructure, which lets you execute the same Beam pipeline across multiple runtimes Next it’s time to get concrete -- I’ll show you how these concepts in practice with a demo of the same pipeline, reading from Kafka, and running on Apache Spark, Apache Flink, and Cloud Dataflow. And finally we’ll end with some pointers for getting started.
  5. We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  6. Let’s take a look at some sample data -- the points scored for a specific team. On the x-axis we’ve got event time, and the y-axis is processing time. <animate>If everything was perfect, elements would arrive in our system immediately, and so we’d see things along this dashed line. But distributed systems often don’t cooperate. <animate> Sometimes it’s not so bad. So here this event from just before 12:07 maybe just encountered a small network delay and arrives almost immediately after 12:07 <animate> But this one over here was more like 7 minutes delayed. Perhaps our user was playing in an elevator or in a subway -- so the score is delayed by a temporary lack of network connectivity. And this graph can’t even contain what we’d see if our game supports an offline mode. If a user is playing on a transatlantic flight in airplane mode, it might be hours until that flight lands and we get those scores for processing. These types of infinite, out of order data sources can be really tricky to reason about… unless you know what questions to ask.
  7. The Beam model is based on four key questions: What results are calculated? Are you doing computing sums, joins, histograms, machine learning models? Where in event time are results calculated? How does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights? How do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another? Let’s take a quick look at how we can use this questions to build a pipeline.
  8. Here’s a snippet from a pipeline that processes scoring results from that mobile gaming application. In Yellow, you can see the computation that we’re performing -- the what -- in this case taking team-score pairs and summing them per team. So now let’s see what happens to our sample data if we execute this in traditional batch style.
  9. In this looping animation, the grey line represents processing time. As the pipeline executes and processes elements, they’re accumulated into the intermediate state, just under the processing time line.. When processing completes, the system emits the result in yellow. This is pretty standard batch processing. But as we dive into the remaining three questions, that’s going to change.
  10. Let’s start by playing with event time. By specifying a windowing function, we can calculate independent results for different slices of event time. For example every minute, every hour, every day... In this case, our same integer summation will output one sum every two minutes.
  11. Now if we look at how things execute, you can see that we are calculating a independent answer for every two minute period of event time. But we’re still waiting until the entire computation completes to emit any results. That might work fine for bounded data sets, when we’ll eventually finish processing. But it’s not going to work if we’re trying to process an infinite amount of data!
  12. In that case we want to reduce the latency of individual results. We do that by asking for results to be triggered based on the system’s best estimate of when it has all the input data. We call this estimate the watermark.
  13. The watermark is drawn in green. And now the result for each window is emitted as soon as we roughly think we’re done. But again, the watermark is often just a heuristic. It’s the system’s best guess about data completeness. Right now, the watermark is too fast -- and in some cases we’re moving on without all the data. So that user who scored 9 points in the elevator is just plain out of luck. But we don’t want to be too slow either -- it’s no good if we wait to emit anything until all the flights everywhere had landed just in case someone in 16B is playing our game.
  14. So let’s use a more sophisticated trigger to request both speculative, early firings as data is still trickling in -- and also update results if late elements arrive. Once we do this though, we might get multiple results for the same window of event time. So we have answer the fourth question about how refined results relate. Here we choose to just continually accumulate the score.
  15. Now, there are multiple results for each window. Some windows, like the second, produce early, incomplete results as data arrives. There’s one on time result when we think we’ve pretty much got all the data. And there are late results if additional data comes in behind the watermark, like in the first window. And because we chose to accumulate, each result includes all the elements in the window, even if they have already been part of an earlier result.
  16. So we took an algorithm -- in this case it happened to be a simple integer summation. I could have used something more complicated -- but the animations would have gotten out of control. And just by tweaking just a line here or there, we went through a number of use cases -- from the simple traditional batch style through to advanced streaming situations. Just like the MapReduce model fundamentally changed the way we do distributed processing by providing the right set of abstractions, we hope that the Beam model will change the way we unify batch and streaming processing in the future.
  17. We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  18. So that was the conceptual introduction to the types of use cases the Beam model can cover. Next, let’s talk about how the model enable portability.
  19. The heart of Beam is the model. We have multiple language-specific SDKs for constructing a Beam pipeline. Developers often have strong opinions about their language of choice, so we want to meet them where they are. Next, we have multiple runners for executing Beam pipelines on existing distributed processing engines. This let’s the user choose the right environment for their use case. It might be on premise, or in the cloud. It might be open source. It might be fully managed. And these needs may change over time. Now each of these runners needs to execute user processing, which means they need the ability to execute code in different languages. And to do that, while making Beam components modular, we are building APIs to that cleanly specify how a runner calls language-specific progressing.
  20. Now that was where the project is going. In reality, this is where we are. The Java SDK runs across multiple runtimes. However the Python SDK currently runs only on Cloud Dataflow in batch, as we’re still building out the streaming and cross-language infrastructure.
  21. Let’s go into a bit more detail about some of these runners. I’m going to focus today on Spark, Flink, and Dataflow. These were the original three runners in Beam and are also the three I’ll be demoing today. Many of you are probably familiar with Apache Spark. It’s a very popular choice right now in the Big Data world. It excels at in-memory and interactive computations. Apache Flink is more of a newcomer to the broader big data scene. It’s got really clean semantics for stream processing. And Cloud Dataflow is GCP’s fully managed service for data processing pipelines that evolved from all those years of internal work.
  22. And though each of these runners does parallel data processing, they have some significant differences in how they go about that. And that makes it tricky to build an abstraction layer around them.
  23. We can’t just take the intersection of the functionality of all the engines -- that’s too limited.
  24. And on the other hand, taking the union would be a kitchen sink of chaos…
  25. Really, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines. Keyed State is a great example of functionality that existed in various engines for a while and enabled interesting and common use cases, and was only recently added to Beam. And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam model.
  26. This also means there may be, at times, some divergence between the Beam model and the support a given runner has. So that's why Beam is tracking which portions of the model each runner currently supports -- and this get updated as new functionality is being built out.
  27. So with those concepts behind us, let’s get hands on.
  28. Finally, let’s look at how you can get started using Apache Beam
  29. And of course, please come learn more about Beam in general. The Beam website has all sorts of good information, including details on all the different runners.