SlideShare a Scribd company logo
1 of 19
Download to read offline
PORTABLE BATCH AND
STREAM DATA PROCESSING
WITH APACHE BEAM
William Vambenepe, Google
@vambenepe
SPEAKERS INFO
WILLIAM VAMBENEPE
Group Product Manager
Data Processing and Analytics
Google Cloud Platform
@vambenepe
 Open source (top-level Apache project)
 Portable
 Unifies batch and stream
 Cloud-native
 Built on 15 years of large scale data processing at Google
You don’t need to be a developer to benefit from Beam
APACHE BEAM: THE KEY TO MODERN DATA PROCESSING
MapReduce
Apache
Beam
Cloud
Dataflow
BigTable Dremel
Colossus
Flume
Megastore Spanner
PubSub
Millwheel
THE EVOLUTION OF DATA PIPELINES
BEAM = Batch + StrEAM
Progressive evolution from batch to stream
- Stream as the new default
Cost/perf trade-offs without re-architecting
- Just turn the knob
ML: data preparation consistency between training & scoring
- Same pipeline to train in batch and score in stream
BENEFIT OF BATCH / STREAM UNIFICATION
PROCESSING
TIME VS.
EVENT TIME
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
The Beam Model:
is being
computed?
WHAT
WHERE
time ?
The Beam Model:
in event
WHEN
time ?
The Beam Model:
in processing
HOW
relate?
The Beam Model:
do refinements
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
PORTABLE
Write once, run anywhere
 The Beam Model: the abstractions at the
core of Apache Beam
 Choice of API: Users write their pipelines in
a language that’s familiar and integrated with
their other tooling
 Choice of Runtime: Users choose the right
runner for their current needs -- on-prem /
cloud, open source / not, fully managed / not
 Scalability for Developers: Clean APIs allow
developers to contribute modules independently
Language B
SDK
Language A
SDK
Language C
SDK
Runner
1
Runner
3
Runner
2
The Beam Model
Language A
Language
C
Language B
The Beam Model
BEAM VISION: MIX AND MATCH SDKS AND RUNTIMES
APACHE SPARK
 Open-source cluster-
computing framework
 Large ecosystem of
APIs and tools
 Runs on premise or
in the cloud
APACHE FLINK
 Open-source distributed data
processing engine
 High-throughput and
low-latency stream processing
 Runs on premise or in the cloud
EXAMPLE BEAM RUNNERS
GOOGLE CLOUD DATAFLOW
 Fully-managed service for batch and
stream data processing
 Provides dynamic auto-scaling,
monitoring tools, and tight integration
with Google Cloud Platform
GA 360
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Stackdriver
Process
Stream
Use
Cloud Dataproc
Cloud Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
ML Engine
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
etc...
SQL
Adwords
DoubleClick
YouTube
BEAM ON GOOGLE CLOUD: SERVERLESS DATA PROCESSING
Streaming 101 and 102: The World Beyond Batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
BEAM
MORE INFO
 Apache Beam: https://beam.apache.org
 Google Cloud Platform: https://cloud.google.com
The Dataflow Model paper from VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
THANK YOU!

More Related Content

Similar to ApacheBeam_Google_Theater_TalendConnect2017.pdf

Analytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWSAnalytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWSAmazon Web Services
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Deep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovateDeep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovateRitesh Toshniwal
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Affordable Workflow Options for APEX
Affordable Workflow Options for APEXAffordable Workflow Options for APEX
Affordable Workflow Options for APEXNiels de Bruijn
 
Porting an Open Source Lp Solver to Web Assembly
 Porting an Open Source Lp Solver to Web Assembly Porting an Open Source Lp Solver to Web Assembly
Porting an Open Source Lp Solver to Web AssemblyFabion Kauker
 
Create and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixCreate and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixDev_Events
 
Beginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANABeginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANAAshish Saxena
 
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...Amazon Web Services
 
SITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on HanaSITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on Hanasitist
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 
H2O platform workshop
H2O platform workshopH2O platform workshop
H2O platform workshopShareThis
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014Amazon Web Services
 
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud
AWS Webcast - The Business Value of Running SAP Solutions on the AWS CloudAWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud
AWS Webcast - The Business Value of Running SAP Solutions on the AWS CloudAmazon Web Services
 

Similar to ApacheBeam_Google_Theater_TalendConnect2017.pdf (20)

Analytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWSAnalytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWS
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Deep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovateDeep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovate
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
PowerApps
PowerAppsPowerApps
PowerApps
 
Intro to Google Cloud Platform Data Engineering.
Intro to Google Cloud Platform Data Engineering.Intro to Google Cloud Platform Data Engineering.
Intro to Google Cloud Platform Data Engineering.
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Affordable Workflow Options for APEX
Affordable Workflow Options for APEXAffordable Workflow Options for APEX
Affordable Workflow Options for APEX
 
Porting an Open Source Lp Solver to Web Assembly
 Porting an Open Source Lp Solver to Web Assembly Porting an Open Source Lp Solver to Web Assembly
Porting an Open Source Lp Solver to Web Assembly
 
Create and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixCreate and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and Bluemix
 
Beginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANABeginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANA
 
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud (D...
 
SITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on HanaSITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on Hana
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
H2O platform workshop
H2O platform workshopH2O platform workshop
H2O platform workshop
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014
(BIZ401) Kellogg Company Runs SAP in a Hybrid Environment | AWS re:Invent 2014
 
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud
AWS Webcast - The Business Value of Running SAP Solutions on the AWS CloudAWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud
AWS Webcast - The Business Value of Running SAP Solutions on the AWS Cloud
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

ApacheBeam_Google_Theater_TalendConnect2017.pdf

  • 1. PORTABLE BATCH AND STREAM DATA PROCESSING WITH APACHE BEAM William Vambenepe, Google @vambenepe
  • 2. SPEAKERS INFO WILLIAM VAMBENEPE Group Product Manager Data Processing and Analytics Google Cloud Platform @vambenepe
  • 3.  Open source (top-level Apache project)  Portable  Unifies batch and stream  Cloud-native  Built on 15 years of large scale data processing at Google You don’t need to be a developer to benefit from Beam APACHE BEAM: THE KEY TO MODERN DATA PROCESSING
  • 5. BEAM = Batch + StrEAM
  • 6. Progressive evolution from batch to stream - Stream as the new default Cost/perf trade-offs without re-architecting - Just turn the knob ML: data preparation consistency between training & scoring - Same pipeline to train in batch and score in stream BENEFIT OF BATCH / STREAM UNIFICATION
  • 8. What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate? THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
  • 9. The Beam Model: is being computed? WHAT
  • 10. WHERE time ? The Beam Model: in event
  • 11. WHEN time ? The Beam Model: in processing
  • 13. What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate? THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
  • 15.  The Beam Model: the abstractions at the core of Apache Beam  Choice of API: Users write their pipelines in a language that’s familiar and integrated with their other tooling  Choice of Runtime: Users choose the right runner for their current needs -- on-prem / cloud, open source / not, fully managed / not  Scalability for Developers: Clean APIs allow developers to contribute modules independently Language B SDK Language A SDK Language C SDK Runner 1 Runner 3 Runner 2 The Beam Model Language A Language C Language B The Beam Model BEAM VISION: MIX AND MATCH SDKS AND RUNTIMES
  • 16. APACHE SPARK  Open-source cluster- computing framework  Large ecosystem of APIs and tools  Runs on premise or in the cloud APACHE FLINK  Open-source distributed data processing engine  High-throughput and low-latency stream processing  Runs on premise or in the cloud EXAMPLE BEAM RUNNERS GOOGLE CLOUD DATAFLOW  Fully-managed service for batch and stream data processing  Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 17. GA 360 Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Stackdriver Process Stream Use Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts ML Engine Batch Firebase Storage Transfer Service Cloud Dataflow etc... SQL Adwords DoubleClick YouTube BEAM ON GOOGLE CLOUD: SERVERLESS DATA PROCESSING
  • 18. Streaming 101 and 102: The World Beyond Batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 BEAM MORE INFO  Apache Beam: https://beam.apache.org  Google Cloud Platform: https://cloud.google.com The Dataflow Model paper from VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf