APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990
ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables
AGENDA
▸ ETL processes and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3
BUT IN OUR PREVIOUS DB DATA HAD
BEEN ARRIVING SECONDS (NOT
HOURS) AFTER IT WAS PRODUCED…
Jane Doe, Department of Analytics,
Company Ltd.
5
LAMBDA ARCHITECTURE
▸ Kafka — source
▸ Hadoop — batch
▸ Flink — speed
▸ Druid — serving
6
LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
7
APACHE
BEAM
APACHE 

BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
APACHE 

BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
[whip sound]
APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
11
IN CODE
▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct)
▸ Java (rich) and Python (poor but pretty) SDK
▸ Open-source Scala API (github –> spotify/scio)
12APACHE BEAM
APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transofrmation — set of operations on data / usually single
operation
13
TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
14APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shuffle phase in Map Reduce
▸ For streaming - windowing or triggers are necessary
15APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary
16APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for specific key in k/v pairs
▸ For streaming accumulates elements per window
17APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Flatten
▸ Merge several PCollections
▸ Partition
▸ Split PCollection
18APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Several already defined transformations:
▸ Filter.By
▸ Count
▸ Custom Transformations
▸ Serializable
▸ Thread-compatible
▸ Idempotent
19APACHE BEAM FRAMEWORK FOUNDATIONS
TAGGED OUTPUT
20APACHE BEAM FRAMEWORK FOUNDATIONS
SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime
21APACHE BEAM FRAMEWORK FOUNDATIONS
IO
22APACHE BEAM FRAMEWORK FOUNDATIONS
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
Solr
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
23
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
24
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT jug -> 10k, java -> 4M, juzwiosna -> 100
25
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
26
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
{j->[java, jug, juzwiosna], ju->[jug, juzwiosna]}
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
TOP(3)
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
27
APACHE BEAM
TWEETS — BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(TextIO.Read.from("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to("...");

p.run();
28
APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output trasnformations
▸ GroupByKey — data is grouped by both key and window
29
WINDOWS
FIXED TIME WINDOWS
30
WINDOWS
SLIDING WINDOWS
31
WINDOWS
SESSION WINDOWS
32
APACHE BEAM
TWEETS – BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(PubsubIO.Write.topic(„...");

p.run();
33
APACHE BEAM
TWEETS – STREAMING
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(Window.into(SlidingWindows.of(

Duration.standardMinutes(60))))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(PubsubIO.Write.topic(”...”);

p.run();
34
APACHE BEAM FRAMEWORK – STREAMING
▸ Watermarks
▸ Simple: lag between event timestamp and processing
time
▸ Beam keeps track of watermark
▸ When window past watermark data is considered late
and discarded
▸ Allow for lateness
35
APACHE BEAM FRAMEWORK – STREAMING
▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
36
APACHE BEAM FRAMEWORK – STATEFUL PROCESSING
37
(k1,w1) (k2,w2) (k3,w3)
"s1" 12 33 -5
"s2" "kot" "pies" "okoń"
"s3" 0,03 0,12 0,33
"s4" "ala" "ma" "kota"
CAPABILITY MATRIX
38APACHE BEAM — BACKENDS
CAPABILITY MATRIX
39APACHE BEAM — BACKENDS
APACHE BEAM – RUN
mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
40
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"

-Pspark-runner
41
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp 
--inputFile=gs://apache-beam-samples/shakespeare/* 

-—output=gs://bb/counts” 
-Pdataflow-runner
42
APACHE BEAM – USE CASE’Y
▸ ETL
▸ Fraud detection
▸ Ads pricing (similar: Uber pricing)
▸ Sentiment analysis
43
LINKS
▸ Google Dataflow paper: https://research.google.com/pubs/
pub43864.html
▸ Apache Beam: https://beam.apache.org/
44
Q&A
THANK YOU
Q&A
THANK YOU
We are hiring ;-)

https://goo.gl/zzqXLS

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

  • 1.
    APACHE BEAM – THEDATA ENGINEER’S HOPE Robert Mroczkowski, Piotr Wikieł Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990
  • 2.
    ABOUT US ▸ DataPlatform Engineers at Allegro ▸ Maintaining probably one of the largest Hadoop cluster in Poland ▸ We use public clouds for data processing on a daily basis ▸ Roots: ▸ Robert — sysop ▸ Piotr — dev 2 VegeTables
  • 3.
    AGENDA ▸ ETL processesand Lambda Architecture ▸ Apache Beam framework foundations ▸ Transformations, windows, tags, etc. ▸ Batch and streaming ▸ Examples, use cases 3
  • 5.
    BUT IN OURPREVIOUS DB DATA HAD BEEN ARRIVING SECONDS (NOT HOURS) AFTER IT WAS PRODUCED… Jane Doe, Department of Analytics, Company Ltd. 5
  • 6.
    LAMBDA ARCHITECTURE ▸ Kafka— source ▸ Hadoop — batch ▸ Flink — speed ▸ Druid — serving 6
  • 7.
    LAMBDA ARCHITECTURE ▸ Complicated,huh? ▸ We have to build separate software for real-time and batch computations ▸ … which have to be maintained, probably by different teams ▸ Why not use one tool to rule them all? 7
  • 8.
  • 9.
    APACHE 
 BEAM UNIFIED MODELFOR EXECUTING BOTH BATCH AND STREAM DATA PROCESSING PIPELINES
  • 10.
    APACHE 
 BEAM UNIFIED MODELFOR EXECUTING BOTH BATCH AND STREAM DATA PROCESSING PIPELINES [whip sound]
  • 11.
    APACHE BEAM ▸ Bornin Google, and then open-sourced ▸ Designed especially for ETL pipelines ▸ Use for both streaming and batch processing ▸ Heavily parallel processing ▸ Exactly once semantics 11
  • 12.
    IN CODE ▸ Backends(Spark, Flink, Apex, Dataflow, Gearpump, Direct) ▸ Java (rich) and Python (poor but pretty) SDK ▸ Open-source Scala API (github –> spotify/scio) 12APACHE BEAM
  • 13.
    APACHE BEAM FRAMEWORK ▸Pipeline ▸ Input/Output ▸ PCollection — distributed data representation (Spark RDD- like) ▸ Transofrmation — set of operations on data / usually single operation 13
  • 14.
    TRANSFORMATIONS ▸ ParDo —like map in MapReduce ▸ Filter elements of PCollection ▸ Format values in PCollection ▸ Cast types ▸ Computations on each single element ▸ collection.apply(ParDo.of(SomeDoFn())) 14APACHE BEAM FRAMEWORK FOUNDATIONS
  • 15.
    TRANSFORMATIONS ▸ GroupByKey ▸ groupvalues of k/v pairs for the same key ▸ like Shuffle phase in Map Reduce ▸ For streaming - windowing or triggers are necessary 15APACHE BEAM FRAMEWORK FOUNDATIONS
  • 16.
    TRANSFORMATIONS ▸ CoGroupByKey ▸ joinvalues of k/v pairs for the same key for separate PCollection ▸ .apply(CoGroupByKey.create()) ▸ For streaming — windowing or triggers are necessary 16APACHE BEAM FRAMEWORK FOUNDATIONS
  • 17.
    TRANSFORMATIONS ▸ Combine ▸ Reducefrom Map Reduce paradigm ▸ Combines all elements in PCollection ▸ Combines elements for specific key in k/v pairs ▸ For streaming accumulates elements per window 17APACHE BEAM FRAMEWORK FOUNDATIONS
  • 18.
    TRANSFORMATIONS ▸ Flatten ▸ Mergeseveral PCollections ▸ Partition ▸ Split PCollection 18APACHE BEAM FRAMEWORK FOUNDATIONS
  • 19.
    TRANSFORMATIONS ▸ Several alreadydefined transformations: ▸ Filter.By ▸ Count ▸ Custom Transformations ▸ Serializable ▸ Thread-compatible ▸ Idempotent 19APACHE BEAM FRAMEWORK FOUNDATIONS
  • 20.
    TAGGED OUTPUT 20APACHE BEAMFRAMEWORK FOUNDATIONS
  • 21.
    SIDE INPUT –ENRICHMENT ▸ Additional data in ParDo ▸ Computed at runtime 21APACHE BEAM FRAMEWORK FOUNDATIONS
  • 22.
    IO 22APACHE BEAM FRAMEWORKFOUNDATIONS FILE MESSAGING DATABASE HDFS Kinesis Cassandra GCS Kafka Hbase S3 PubSub Hive Local JMS BigQuery Avro MQTT BigTable Text DataStore TFRecord Spanner XML Mongo Tika Redis Solr
  • 23.
  • 24.
    APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jugjuz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna 24
  • 25.
    APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jugjuz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT jug -> 10k, java -> 4M, juzwiosna -> 100 25
  • 26.
    APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jugjuz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT EXPAND jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 26
  • 27.
    APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jugjuz dzis; #java 10 GA released {j->[java, jug, juzwiosna], ju->[jug, juzwiosna]} EXTRACT jug, java, juzwiosna COUNT EXPAND TOP(3) jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 27
  • 28.
    APACHE BEAM TWEETS —BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(TextIO.Read.from("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(TextIO.Write.to("...");
 p.run(); 28
  • 29.
    APACHE BEAM FRAMEWORK– STREAMING ▸ Windows ▸ One global window by default ▸ Applied for group, combine or output trasnformations ▸ GroupByKey — data is grouped by both key and window 29
  • 30.
  • 31.
  • 32.
  • 33.
    APACHE BEAM TWEETS –BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(PubsubIO.Write.topic(„...");
 p.run(); 33
  • 34.
    APACHE BEAM TWEETS –STREAMING READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(Window.into(SlidingWindows.of(
 Duration.standardMinutes(60))))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(PubsubIO.Write.topic(”...”);
 p.run(); 34
  • 35.
    APACHE BEAM FRAMEWORK– STREAMING ▸ Watermarks ▸ Simple: lag between event timestamp and processing time ▸ Beam keeps track of watermark ▸ When window past watermark data is considered late and discarded ▸ Allow for lateness 35
  • 36.
    APACHE BEAM FRAMEWORK– STREAMING ▸ Triggers ▸ Change default windowing behaviour ▸ Completness / Latency / Cost ▸ Event Time / Processing Time / Data 36
  • 37.
    APACHE BEAM FRAMEWORK– STATEFUL PROCESSING 37 (k1,w1) (k2,w2) (k3,w3) "s1" 12 33 -5 "s2" "kot" "pies" "okoń" "s3" 0,03 0,12 0,33 "s4" "ala" "ma" "kota"
  • 38.
  • 39.
  • 40.
    APACHE BEAM –RUN mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner 40
  • 41.
    APACHE BEAM –RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"
 -Pspark-runner 41
  • 42.
    APACHE BEAM –RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp --inputFile=gs://apache-beam-samples/shakespeare/* 
 -—output=gs://bb/counts” -Pdataflow-runner 42
  • 43.
    APACHE BEAM –USE CASE’Y ▸ ETL ▸ Fraud detection ▸ Ads pricing (similar: Uber pricing) ▸ Sentiment analysis 43
  • 44.
    LINKS ▸ Google Dataflowpaper: https://research.google.com/pubs/ pub43864.html ▸ Apache Beam: https://beam.apache.org/ 44
  • 45.
  • 46.
    Q&A THANK YOU We arehiring ;-)
 https://goo.gl/zzqXLS