SlideShare a Scribd company logo
Introduction to Apache Beam
JB Onofré - Talend
Who am I ?
● Talend
○ Software Architect
○ Apache team
● Apache
○ Member of the Apache Software Foundation
○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn,
Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi,
Guacamole, BatchEE, Sirona, Incubator, …)
What is Apache Beam?
1. Agnostic (unified batch + stream) Beam programming model
2. Dataflow Java SDK (soon Python, DSLs)
3. Runners for Dataflow
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
e. OSGi/Karaf
Why Apache Beam?
1. Portable - You can use the same code with different runners (abstraction) and
backends on premise, in the cloud, or locally
2. Unified - Same unified model for batch and stream processing
3. Advanced features - Event windowing, triggering, watermarking, lateless, etc.
4. Extensible model and SDK - Extensible API; can define custom sources to
read and write in parallel
Beam Programming Model
Data processing pipeline
(executed via a Beam runner)
PTransform/IO PTransform PTransformInput Output
Beam Programming Model
1. Pipelines - data processing job as a directed graph of steps
2. PCollection - the data inside a pipeline
3. Transform - a step in the pipeline (taking PCollections as input, and produce
PCollections)
a. Core transforms - common transformation provided (ParDo, GroupByKey, …)
b. Composite transforms - combine multiple transforms
c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use
PCollections to “write” data outside of the pipeline (producer)
Beam Programming Model - PCollection
1. PCollection is immutable, does not support random access to element, belong
to a pipeline
2. Each element in PCollection has a timestamp (set by IO Source)
3. Coder to support different data types
4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO
Source)
5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp)
a. Fixed time window
b. Sliding time window
c. Session window
d. Global window (for bounded PCollection)
e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data-
based with counting, composite)
Beam Programming Model - IO
1. IO Sources (read data as PCollections) and Sinks (write PCollections)
2. Support Bounded and/or Unbounded PCollections
3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …)
4. Custom IO - extensible IO API to create custom sources & sinks
5. Should deal with timestamp, watermark, deduplication, parallelism (depending
of the needs)
Apache Beam SDKs
1. API for Beam Programming Model (design pipelines, transforms, …)
2. Current SDKs
a. Java - First SDK and primary focus for refactoring and improvement
b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once
the Java SDK has been stabilized (and APIs/interfaces redefined)
3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc.
4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top
of Java SDK, …)
Java SDK
public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));
p.apply(TextIO.Read.from("/path/to...")) // Read input.
.apply(new CountWords()) // Do some processing.
.apply(TextIO.Write.to("/path/to...")); // Write output.
// Run the pipeline.
p.run();
}
Beam Programming Model
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(SessionWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which
are powerful but hard to express in other models and languages.
Runners and Backends
● Runners “translate” the code to a target backend (the runner itself doesn’t
provide the backend)
● Many runners are tied to other top-level Apache projects, such as Apache Flink
and apache Spark
● Due to this, runners can be run on-premise (on your local Flink cluster) or in a
public cloud (using Google Cloud Dataproc or Amazon EMR) for example
● Apache Beam is focused on treating runners as a top-level use case (with APIs,
support, etc.) so runners can be developed with minimal friction for maximum
pipeline portability
Beam Runners
Google Cloud Dataflow Apache Flink* Apache Spark*
[*] With varying levels of fidelity.
The Apache Beam (http://beam.incubator.apache.org) site will have more details soon.
?
Other Runner*
(local, OSGi, …)
Use Cases
Apache Beam is a great choice for both batch and stream processing and can
handle bounded and unbounded datasets
Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on
Stream can focus on handling real-time processing on a record-by-record basis
Real use cases
● Mobile gaming data processing, both batch and stream processing (https:
//github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/)
● Real-time event processing from IoT devices
Use Case - Gaming
● A game store the gaming results in the CSV file:
○ Player,team,score,timestamp
● Two pipelines:
○ UserScore (batch) sum scores for each user
○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per
team on fixed windows.
User Game - Gaming - UserScore - Pipeline
Pipeline pipeline = Pipeline.create(options);
// Read events from a text file and parse them.
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))
// Extract and sum username/score pairs from the event data.
.apply("ExtractUserScore", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums",
new WriteToBigQuery<KV<String, Integer>>(options.
getTableName(),
configureBigQueryWrite()));
// Run the batch pipeline.
pipeline.run();
User Game - Gaming - UserScore - Avro Coder
@DefaultCoder(AvroCoder.class)
static class GameActionInfo {
@Nullable String user;
@Nullable String team;
@Nullable Integer score;
@Nullable Long timestamp;
public GameActionInfo(String user, String team, Integer score, Long
timestamp) {
…
}
…}
User Game - Gaming - UserScore - Parse Event Fn
static class ParseEventFn extends DoFn<String, GameActionInfo> {
// Log and count parse errors.
private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class);
private final Aggregator<Long, Long> numParseErrors =
createAggregator("ParseErrors", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
String[] components = c.element().split(",");
try {
String user = components[0].trim();
String team = components[1].trim();
Integer score = Integer.parseInt(components[2].trim());
Long timestamp = Long.parseLong(components[3].trim());
GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp);
c.output(gInfo);
} catch (ArrayIndexOutOfBoundsException | NumberFormatException e) {
numParseErrors.addValue(1L);
LOG.info("Parse error on " + c.element() + ", " + e.getMessage());
}
}
}
User Game - Gaming - UserScore - Sum Score Tr
public static class ExtractAndSumScore
extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> {
private final String field;
ExtractAndSumScore(String field) {
this.field = field;
}
@Override
public PCollection<KV<String, Integer>> apply(
PCollection<GameActionInfo> gameInfo) {
return gameInfo
.apply(MapElements
.via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore()))
.withOutputType(new TypeDescriptor<KV<String, Integer>>() {}))
.apply(Sum.<String>integersPerKey());
}
}
User Game - Gaming - HourlyScore - Pipeline
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn()))
// filter with byPredicate to ignore some data
.apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() > startMinTimestamp.getMillis()))
.apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() < stopMinTimestamp.getMillis()))
// use fixed-time window
.apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp())))
.apply(Window.named("FixedWindowsTeam")
.<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60)))
// extract and sum teamname/score pairs from the event data.
.apply("ExtractTeamScore", new ExtractAndSumScore("team"))
// write the result
.apply("WriteTeamScoreSums",
new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(),
configureWindowedTableWrite()));
pipeline.run();
Roadmap
02/01/2016
Enter Apache
Incubator
End 2016
Cloud Dataflow
should run Beam
pipelines
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Slight chaos
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
More information and get involved!
1: Read about Apache Beam
Apache Beam website - http://beam.incubator.apache.org
2: See what the Apache Beam team is doing
Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM
Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/
3: Contribute!
Apache Beam git repo - https://github.com/apache/incubator-beam
Q&A

More Related Content

What's hot

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
confluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
Ivan Glushkov
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
confluent
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Materialize: a platform for changing data
Materialize: a platform for changing dataMaterialize: a platform for changing data
Materialize: a platform for changing data
Altinity Ltd
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 

What's hot (20)

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Materialize: a platform for changing data
Materialize: a platform for changing dataMaterialize: a platform for changing data
Materialize: a platform for changing data
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 

Similar to Introduction to Apache Beam

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
Julien Vermillard
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
Logging & Metrics with Docker
Logging & Metrics with DockerLogging & Metrics with Docker
Logging & Metrics with Docker
Stefan Zier
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack DiscussionZaiyang Li
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the restgeorge.james
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
The future of server side JavaScript
The future of server side JavaScriptThe future of server side JavaScript
The future of server side JavaScriptOleg Podsechin
 
BigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache ApexBigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
Thomas Weise
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
iguazio
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Eran Duchan
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 

Similar to Introduction to Apache Beam (20)

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Logging & Metrics with Docker
Logging & Metrics with DockerLogging & Metrics with Docker
Logging & Metrics with Docker
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack Discussion
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
The future of server side JavaScript
The future of server side JavaScriptThe future of server side JavaScript
The future of server side JavaScript
 
BigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache ApexBigDataSpain 2016: Stream Processing Applications with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 

Recently uploaded

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 

Recently uploaded (20)

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 

Introduction to Apache Beam

  • 1. Introduction to Apache Beam JB Onofré - Talend
  • 2. Who am I ? ● Talend ○ Software Architect ○ Apache team ● Apache ○ Member of the Apache Software Foundation ○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn, Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi, Guacamole, BatchEE, Sirona, Incubator, …)
  • 3. What is Apache Beam? 1. Agnostic (unified batch + stream) Beam programming model 2. Dataflow Java SDK (soon Python, DSLs) 3. Runners for Dataflow a. Apache Flink (thanks to data Artisans) b. Apache Spark (thanks to Cloudera) c. Google Cloud Dataflow (fast, no-ops) d. Local (in-process) runner for testing e. OSGi/Karaf
  • 4. Why Apache Beam? 1. Portable - You can use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally 2. Unified - Same unified model for batch and stream processing 3. Advanced features - Event windowing, triggering, watermarking, lateless, etc. 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
  • 5. Beam Programming Model Data processing pipeline (executed via a Beam runner) PTransform/IO PTransform PTransformInput Output
  • 6. Beam Programming Model 1. Pipelines - data processing job as a directed graph of steps 2. PCollection - the data inside a pipeline 3. Transform - a step in the pipeline (taking PCollections as input, and produce PCollections) a. Core transforms - common transformation provided (ParDo, GroupByKey, …) b. Composite transforms - combine multiple transforms c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use PCollections to “write” data outside of the pipeline (producer)
  • 7. Beam Programming Model - PCollection 1. PCollection is immutable, does not support random access to element, belong to a pipeline 2. Each element in PCollection has a timestamp (set by IO Source) 3. Coder to support different data types 4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO Source) 5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp) a. Fixed time window b. Sliding time window c. Session window d. Global window (for bounded PCollection) e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data- based with counting, composite)
  • 8. Beam Programming Model - IO 1. IO Sources (read data as PCollections) and Sinks (write PCollections) 2. Support Bounded and/or Unbounded PCollections 3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …) 4. Custom IO - extensible IO API to create custom sources & sinks 5. Should deal with timestamp, watermark, deduplication, parallelism (depending of the needs)
  • 9. Apache Beam SDKs 1. API for Beam Programming Model (design pipelines, transforms, …) 2. Current SDKs a. Java - First SDK and primary focus for refactoring and improvement b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once the Java SDK has been stabilized (and APIs/interfaces redefined) 3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc. 4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top of Java SDK, …)
  • 10. Java SDK public static void main(String[] args) { // Create a pipeline parameterized by commandline flags. Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(TextIO.Read.from("/path/to...")) // Read input. .apply(new CountWords()) // Do some processing. .apply(TextIO.Write.to("/path/to...")); // Write output. // Run the pipeline. p.run(); }
  • 11. Beam Programming Model PCollection<KV<String, Integer>> scores = input .apply(Window.into(SessionWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which are powerful but hard to express in other models and languages.
  • 12. Runners and Backends ● Runners “translate” the code to a target backend (the runner itself doesn’t provide the backend) ● Many runners are tied to other top-level Apache projects, such as Apache Flink and apache Spark ● Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example ● Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
  • 13. Beam Runners Google Cloud Dataflow Apache Flink* Apache Spark* [*] With varying levels of fidelity. The Apache Beam (http://beam.incubator.apache.org) site will have more details soon. ? Other Runner* (local, OSGi, …)
  • 14. Use Cases Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases ● Mobile gaming data processing, both batch and stream processing (https: //github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/) ● Real-time event processing from IoT devices
  • 15. Use Case - Gaming ● A game store the gaming results in the CSV file: ○ Player,team,score,timestamp ● Two pipelines: ○ UserScore (batch) sum scores for each user ○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per team on fixed windows.
  • 16. User Game - Gaming - UserScore - Pipeline Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline.apply(TextIO.Read.from(options.getInput())) .apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) // Extract and sum username/score pairs from the event data. .apply("ExtractUserScore", new ExtractAndSumScore("user")) .apply("WriteUserScoreSums", new WriteToBigQuery<KV<String, Integer>>(options. getTableName(), configureBigQueryWrite())); // Run the batch pipeline. pipeline.run();
  • 17. User Game - Gaming - UserScore - Avro Coder @DefaultCoder(AvroCoder.class) static class GameActionInfo { @Nullable String user; @Nullable String team; @Nullable Integer score; @Nullable Long timestamp; public GameActionInfo(String user, String team, Integer score, Long timestamp) { … } …}
  • 18. User Game - Gaming - UserScore - Parse Event Fn static class ParseEventFn extends DoFn<String, GameActionInfo> { // Log and count parse errors. private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class); private final Aggregator<Long, Long> numParseErrors = createAggregator("ParseErrors", new Sum.SumLongFn()); @Override public void processElement(ProcessContext c) { String[] components = c.element().split(","); try { String user = components[0].trim(); String team = components[1].trim(); Integer score = Integer.parseInt(components[2].trim()); Long timestamp = Long.parseLong(components[3].trim()); GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp); c.output(gInfo); } catch (ArrayIndexOutOfBoundsException | NumberFormatException e) { numParseErrors.addValue(1L); LOG.info("Parse error on " + c.element() + ", " + e.getMessage()); } } }
  • 19. User Game - Gaming - UserScore - Sum Score Tr public static class ExtractAndSumScore extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> { private final String field; ExtractAndSumScore(String field) { this.field = field; } @Override public PCollection<KV<String, Integer>> apply( PCollection<GameActionInfo> gameInfo) { return gameInfo .apply(MapElements .via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore())) .withOutputType(new TypeDescriptor<KV<String, Integer>>() {})) .apply(Sum.<String>integersPerKey()); } }
  • 20. User Game - Gaming - HourlyScore - Pipeline pipeline.apply(TextIO.Read.from(options.getInput())) .apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn())) // filter with byPredicate to ignore some data .apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo) -> gInfo.getTimestamp() > startMinTimestamp.getMillis())) .apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo) -> gInfo.getTimestamp() < stopMinTimestamp.getMillis())) // use fixed-time window .apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp()))) .apply(Window.named("FixedWindowsTeam") .<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60))) // extract and sum teamname/score pairs from the event data. .apply("ExtractTeamScore", new ExtractAndSumScore("team")) // write the result .apply("WriteTeamScoreSums", new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(), configureWindowedTableWrite())); pipeline.run();
  • 21. Roadmap 02/01/2016 Enter Apache Incubator End 2016 Cloud Dataflow should run Beam pipelines Early 2016 Design for use cases, begin refactoring Mid 2016 Slight chaos Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository
  • 22. More information and get involved! 1: Read about Apache Beam Apache Beam website - http://beam.incubator.apache.org 2: See what the Apache Beam team is doing Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/ 3: Contribute! Apache Beam git repo - https://github.com/apache/incubator-beam
  • 23. Q&A