Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera

APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł

ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Both interested in ML
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables

AGENDA
▸ ETL and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3

LAMBDA ARCHITECTURE
BATCH SERVING
SPEED
DATA
QUERY
QUERY

LAMBDA ARCHITECTURE
BATCH SERVING
SPEED
QUERY
QUERY
DATA
Spark Druid
Spark/Flink
Kafka
Analyst
Microservice

LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
6

APACHE  
BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES

APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
9

IN CODE
▸ Backends (Spark, Flink, Apex, Dataﬂow, Gearpump, Direct)
▸ Java (rich), Python (pretty) SDK and recently added GO
SDK
▸ Experimental SQL on PCollections
▸ Open-source Scala API (github –> spotify/scio)
10APACHE BEAM

APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transformation — operation applied on PCollection
11

PCOLLECTION
▸ Any type but all one type - serializable
▸ Immutable
▸ Any size - bounded, unbounded
▸ Timestamps
12APACHE BEAM FRAMEWORK FOUNDATIONS

TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))

TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shufﬂe phase in Map Reduce
▸ For streaming - windowing or triggers are necessary

TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary

TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for speciﬁc key in k/v pairs or entire
PCollection
▸ Comutative & Associative Function
▸ For streaming accumulates elements per window

TRANSFORMATIONS
▸ Flatten
▸ Merge several PCollections
▸ Partition
▸ Split PCollection
▸ Partitioning function(element, numberOfPartitions)

TRANSFORMATIONS
▸ Several already deﬁned transformations:
▸ Filter.By
▸ Count
▸ Custom Transformations
▸ Serializable
▸ Thread-compatible
▸ Idempotent

TRANSFORMATIONS – SPLITTABLE DOFN
▸ Split processing one element to many workers
▸ Possibly unbounded result of ParDo’ing one element
▸ Examples:
▸ tail -f logs-directory
▸ running jobs outside of Beam and process result within
it
▸ Currently supported in Dataﬂow and Flink runners

TAGGED OUTPUT

SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime
▸ words.apply(ParDo.of(...).withSideInputs(dataView);

IO
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
ParquetIO Solr

TEKST
SEEN SO FAR
RUNNER
SDK
TRANSFORMS, IO
USER CODE
MODEL

APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
24

APACHE BEAM
Predictions
Tweets
READS
WRITES
EXTRACT jug, java, juzwiosna
25

APACHE BEAM
Predictions
Tweets
READS
WRITES
COUNT jug -> 10k, java -> 4M, juzwiosna -> 100
26

APACHE BEAM
Predictions
Tweets
READS
WRITES
COUNT
EXPAND
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100],  
ju-> [jug -> 10k, juzwiosna -> 100]}
27

APACHE BEAM
Predictions
Tweets
READS
WRITES
{j->[java, jug, juzwiosna], ju->[jug, juzwiosna]}
COUNT
EXPAND
TOP(3)
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100],  
ju-> [jug -> 10k, juzwiosna -> 100]}
28

APACHE BEAM
TWEETS — BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions()); 
p.begin() 
.apply(TextIO.Read.from("...")) 
.apply(ParDo.of(new ExtractTags())) 
.apply(Count.perElement()) 
.apply(ParDo.of(new ExpandPrefixes())) 
.apply(Top.largestPerKey(3)) 
.apply(TextIO.Write.to("..."); 
p.run();
29

APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output transformations
▸ GroupByKey — data is grouped by both key and window
30

APACHE BEAM
TWEETS – BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
p.begin() 
.apply(PubsubIO.Read.topic("...")) 
.apply(PubsubIO.Write.topic(„..."); 
p.run();
34

APACHE BEAM
TWEETS – STREAMING
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
p.begin() 
.apply(PubsubIO.Read.topic("...")) 
.apply(Window.into(SlidingWindows.of( 
Duration.standardMinutes(60)))) 
.apply(CassandraIO.<Hashtag>write()); 
p.run();
35

▸ Watermark is approximate lag between event timestamp
and processing time
▸ Beam keeps track of watermark and use it to ﬁre aggregates
▸ when window passes watermark data is considered late and
is discarded
▸ but... you can allow for lateness
▸ FixedWindows.of(..) 
.withAllowedLateness(Duration.standardDays(2))
36

▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
37

APACHE BEAM FRAMEWORK – STATEFUL PROCESSING
38
(k1,w1) (k2,w2) (k3,w3)
"s1" 12 33 -5
"s2" "kot" "pies" "okoń"
"s3" 0,03 0,12 0,33
"s4" "ala" "ma" "kota"

CAPABILITY MATRIX
39APACHE BEAM — BACKENDS

CAPABILITY MATRIX
40APACHE BEAM — BACKENDS

APACHE BEAM – RUN
mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
41

APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts" 
-Pspark-runner
42

APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp
--inputFile=gs://apache-beam-samples/shakespeare/*  
-—output=gs://bb/counts”
-Pdataflow-runner
43

APACHE BEAM – USE CASE’Y
▸ ETL
▸ Fraud detection
▸ Ads pricing (similar: Uber pricing)
▸ Sentiment analysis
44

LINKS
▸ Google Dataﬂow paper: https://research.google.com/pubs/
pub43864.html
▸ Apache Beam: https://beam.apache.org/
▸ Design documents: https://wtanaka.com/beam/design-doc
45

Q&A
THANK YOU
We are hiring ;-) 
https://goo.gl/zzqXLS

Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera

Similar to Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera (20)

Recently uploaded

Recently uploaded (20)

Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera