W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
1. APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
2. ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Both interested in ML
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables
3. AGENDA
▸ ETL and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3
6. LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
6
9. APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
9
10. IN CODE
▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct)
▸ Java (rich), Python (pretty) SDK and recently added GO
SDK
▸ Experimental SQL on PCollections
▸ Open-source Scala API (github –> spotify/scio)
10APACHE BEAM
12. PCOLLECTION
▸ Any type but all one type - serializable
▸ Immutable
▸ Any size - bounded, unbounded
▸ Timestamps
12APACHE BEAM FRAMEWORK FOUNDATIONS
13. TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
13APACHE BEAM FRAMEWORK FOUNDATIONS
14. TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shuffle phase in Map Reduce
▸ For streaming - windowing or triggers are necessary
14APACHE BEAM FRAMEWORK FOUNDATIONS
15. TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary
15APACHE BEAM FRAMEWORK FOUNDATIONS
16. TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for specific key in k/v pairs or entire
PCollection
▸ Comutative & Associative Function
▸ For streaming accumulates elements per window
16APACHE BEAM FRAMEWORK FOUNDATIONS
19. TRANSFORMATIONS – SPLITTABLE DOFN
▸ Split processing one element to many workers
▸ Possibly unbounded result of ParDo’ing one element
▸ Examples:
▸ tail -f logs-directory
▸ running jobs outside of Beam and process result within
it
▸ Currently supported in Dataflow and Flink runners
19APACHE BEAM FRAMEWORK FOUNDATIONS
21. SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime
▸ words.apply(ParDo.of(...).withSideInputs(dataView);
21APACHE BEAM FRAMEWORK FOUNDATIONS
22. IO
22APACHE BEAM FRAMEWORK FOUNDATIONS
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
ParquetIO Solr
30. APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output transformations
▸ GroupByKey — data is grouped by both key and window
30
36. APACHE BEAM FRAMEWORK – STREAMING
▸ Watermark is approximate lag between event timestamp
and processing time
▸ Beam keeps track of watermark and use it to fire aggregates
▸ when window passes watermark data is considered late and
is discarded
▸ but... you can allow for lateness
▸ FixedWindows.of(..)
.withAllowedLateness(Duration.standardDays(2))
36
37. APACHE BEAM FRAMEWORK – STREAMING
▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
37