Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990

ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables

AGENDA
▸ ETL processes and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3

BUT IN OUR PREVIOUS DB DATA HAD
BEEN ARRIVING SECONDS (NOT
HOURS) AFTER IT WAS PRODUCED…
Jane Doe, Department of Analytics,
Company Ltd.
5

LAMBDA ARCHITECTURE
▸ Kafka — source
▸ Hadoop — batch
▸ Flink — speed
▸ Druid — serving
6

LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
7

APACHE  
BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES

APACHE  
BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
[whip sound]

APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
11

IN CODE
▸ Backends (Spark, Flink, Apex, Dataﬂow, Gearpump, Direct)
▸ Java (rich) and Python (poor but pretty) SDK
▸ Open-source Scala API (github –> spotify/scio)
12APACHE BEAM

APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transofrmation — set of operations on data / usually single
operation
13

TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
14APACHE BEAM FRAMEWORK FOUNDATIONS

TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shufﬂe phase in Map Reduce
▸ For streaming - windowing or triggers are necessary

TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary

TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for speciﬁc key in k/v pairs
▸ For streaming accumulates elements per window

TRANSFORMATIONS
▸ Flatten
▸ Merge several PCollections
▸ Partition
▸ Split PCollection

TRANSFORMATIONS
▸ Several already deﬁned transformations:
▸ Filter.By
▸ Count
▸ Custom Transformations
▸ Serializable
▸ Thread-compatible
▸ Idempotent

TAGGED OUTPUT

SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime

IO
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
Solr

APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
23

APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
EXTRACT jug, java, juzwiosna
24

APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
COUNT jug -> 10k, java -> 4M, juzwiosna -> 100
25

APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
COUNT
EXPAND
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100],  
ju-> [jug -> 10k, juzwiosna -> 100]}
26

APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
{j->[java, jug, juzwiosna], ju->[jug, juzwiosna]}
COUNT
EXPAND
TOP(3)
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100],  
ju-> [jug -> 10k, juzwiosna -> 100]}
27

APACHE BEAM
TWEETS — BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions()); 
p.begin() 
.apply(TextIO.Read.from("...")) 
.apply(ParDo.of(new ExtractTags())) 
.apply(Count.perElement()) 
.apply(ParDo.of(new ExpandPrefixes())) 
.apply(Top.largestPerKey(3)) 
.apply(TextIO.Write.to("..."); 
p.run();
28

APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output trasnformations
▸ GroupByKey — data is grouped by both key and window
29

APACHE BEAM
TWEETS – BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
p.begin() 
.apply(PubsubIO.Read.topic("...")) 
.apply(PubsubIO.Write.topic(„..."); 
p.run();
33

APACHE BEAM
TWEETS – STREAMING
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
p.begin() 
.apply(PubsubIO.Read.topic("...")) 
.apply(Window.into(SlidingWindows.of( 
Duration.standardMinutes(60)))) 
.apply(PubsubIO.Write.topic(”...”); 
p.run();
34

▸ Watermarks
▸ Simple: lag between event timestamp and processing
time
▸ Beam keeps track of watermark
▸ When window past watermark data is considered late
and discarded
▸ Allow for lateness
35

▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
36

APACHE BEAM FRAMEWORK – STATEFUL PROCESSING
37
(k1,w1) (k2,w2) (k3,w3)
"s1" 12 33 -5
"s2" "kot" "pies" "okoń"
"s3" 0,03 0,12 0,33
"s4" "ala" "ma" "kota"

CAPABILITY MATRIX
38APACHE BEAM — BACKENDS

CAPABILITY MATRIX
39APACHE BEAM — BACKENDS

APACHE BEAM – RUN
mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
40

APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts" 
-Pspark-runner
41

APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp
--inputFile=gs://apache-beam-samples/shakespeare/*  
-—output=gs://bb/counts”
-Pdataflow-runner
42

APACHE BEAM – USE CASE’Y
▸ ETL
▸ Fraud detection
▸ Ads pricing (similar: Uber pricing)
▸ Sentiment analysis
43

LINKS
▸ Google Dataﬂow paper: https://research.google.com/pubs/
pub43864.html
▸ Apache Beam: https://beam.apache.org/
44

Q&A
THANK YOU
We are hiring ;-) 
https://goo.gl/zzqXLS

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Similar to Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018 (20)

Recently uploaded

Recently uploaded (20)

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018