W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Prosigns: Transforming Business with Tailored Technology Solutions
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
1. APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990
2. ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables
3. AGENDA
▸ ETL processes and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3
4.
5. BUT IN OUR PREVIOUS DB DATA HAD
BEEN ARRIVING SECONDS (NOT
HOURS) AFTER IT WAS PRODUCED…
Jane Doe, Department of Analytics,
Company Ltd.
5
7. LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
7
10. APACHE
BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
[whip sound]
11. APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
11
12. IN CODE
▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct)
▸ Java (rich) and Python (poor but pretty) SDK
▸ Open-source Scala API (github –> spotify/scio)
12APACHE BEAM
13. APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transofrmation — set of operations on data / usually single
operation
13
14. TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
14APACHE BEAM FRAMEWORK FOUNDATIONS
15. TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shuffle phase in Map Reduce
▸ For streaming - windowing or triggers are necessary
15APACHE BEAM FRAMEWORK FOUNDATIONS
16. TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary
16APACHE BEAM FRAMEWORK FOUNDATIONS
17. TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for specific key in k/v pairs
▸ For streaming accumulates elements per window
17APACHE BEAM FRAMEWORK FOUNDATIONS
29. APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output trasnformations
▸ GroupByKey — data is grouped by both key and window
29
35. APACHE BEAM FRAMEWORK – STREAMING
▸ Watermarks
▸ Simple: lag between event timestamp and processing
time
▸ Beam keeps track of watermark
▸ When window past watermark data is considered late
and discarded
▸ Allow for lateness
35
36. APACHE BEAM FRAMEWORK – STREAMING
▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
36