3. Open source (top-level Apache project)
Portable
Unifies batch and stream
Cloud-native
Built on 15 years of large scale data processing at Google
You don’t need to be a developer to benefit from Beam
APACHE BEAM: THE KEY TO MODERN DATA PROCESSING
6. Progressive evolution from batch to stream
- Stream as the new default
Cost/perf trade-offs without re-architecting
- Just turn the knob
ML: data preparation consistency between training & scoring
- Same pipeline to train in batch and score in stream
BENEFIT OF BATCH / STREAM UNIFICATION
8. What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
13. What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
15. The Beam Model: the abstractions at the
core of Apache Beam
Choice of API: Users write their pipelines in
a language that’s familiar and integrated with
their other tooling
Choice of Runtime: Users choose the right
runner for their current needs -- on-prem /
cloud, open source / not, fully managed / not
Scalability for Developers: Clean APIs allow
developers to contribute modules independently
Language B
SDK
Language A
SDK
Language C
SDK
Runner
1
Runner
3
Runner
2
The Beam Model
Language A
Language
C
Language B
The Beam Model
BEAM VISION: MIX AND MATCH SDKS AND RUNTIMES
16. APACHE SPARK
Open-source cluster-
computing framework
Large ecosystem of
APIs and tools
Runs on premise or
in the cloud
APACHE FLINK
Open-source distributed data
processing engine
High-throughput and
low-latency stream processing
Runs on premise or in the cloud
EXAMPLE BEAM RUNNERS
GOOGLE CLOUD DATAFLOW
Fully-managed service for batch and
stream data processing
Provides dynamic auto-scaling,
monitoring tools, and tight integration
with Google Cloud Platform
17. GA 360
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Stackdriver
Process
Stream
Use
Cloud Dataproc
Cloud Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
ML Engine
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
etc...
SQL
Adwords
DoubleClick
YouTube
BEAM ON GOOGLE CLOUD: SERVERLESS DATA PROCESSING
18. Streaming 101 and 102: The World Beyond Batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
BEAM
MORE INFO
Apache Beam: https://beam.apache.org
Google Cloud Platform: https://cloud.google.com
The Dataflow Model paper from VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf