SlideShare a Scribd company logo
APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990
ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables
AGENDA
▸ ETL processes and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3
BUT IN OUR PREVIOUS DB DATA HAD
BEEN ARRIVING SECONDS (NOT
HOURS) AFTER IT WAS PRODUCED…
Jane Doe, Department of Analytics,
Company Ltd.
5
LAMBDA ARCHITECTURE
▸ Kafka — source
▸ Hadoop — batch
▸ Flink — speed
▸ Druid — serving
6
LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
7
APACHE
BEAM
APACHE 

BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
APACHE 

BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
[whip sound]
APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
11
IN CODE
▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct)
▸ Java (rich) and Python (poor but pretty) SDK
▸ Open-source Scala API (github –> spotify/scio)
12APACHE BEAM
APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transofrmation — set of operations on data / usually single
operation
13
TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
14APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shuffle phase in Map Reduce
▸ For streaming - windowing or triggers are necessary
15APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary
16APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for specific key in k/v pairs
▸ For streaming accumulates elements per window
17APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Flatten
▸ Merge several PCollections
▸ Partition
▸ Split PCollection
18APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Several already defined transformations:
▸ Filter.By
▸ Count
▸ Custom Transformations
▸ Serializable
▸ Thread-compatible
▸ Idempotent
19APACHE BEAM FRAMEWORK FOUNDATIONS
TAGGED OUTPUT
20APACHE BEAM FRAMEWORK FOUNDATIONS
SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime
21APACHE BEAM FRAMEWORK FOUNDATIONS
IO
22APACHE BEAM FRAMEWORK FOUNDATIONS
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
Solr
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
23
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
24
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT jug -> 10k, java -> 4M, juzwiosna -> 100
25
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
26
APACHE BEAM
TWEETS
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug juz dzis; #java 10 GA released
{j->[java, jug, juzwiosna], ju->[jug, juzwiosna]}
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
TOP(3)
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
27
APACHE BEAM
TWEETS — BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(TextIO.Read.from("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to("...");

p.run();
28
APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output trasnformations
▸ GroupByKey — data is grouped by both key and window
29
WINDOWS
FIXED TIME WINDOWS
30
WINDOWS
SLIDING WINDOWS
31
WINDOWS
SESSION WINDOWS
32
APACHE BEAM
TWEETS – BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(PubsubIO.Write.topic(„...");

p.run();
33
APACHE BEAM
TWEETS – STREAMING
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(Window.into(SlidingWindows.of(

Duration.standardMinutes(60))))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(PubsubIO.Write.topic(”...”);

p.run();
34
APACHE BEAM FRAMEWORK – STREAMING
▸ Watermarks
▸ Simple: lag between event timestamp and processing
time
▸ Beam keeps track of watermark
▸ When window past watermark data is considered late
and discarded
▸ Allow for lateness
35
APACHE BEAM FRAMEWORK – STREAMING
▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
36
APACHE BEAM FRAMEWORK – STATEFUL PROCESSING
37
(k1,w1) (k2,w2) (k3,w3)
"s1" 12 33 -5
"s2" "kot" "pies" "okoń"
"s3" 0,03 0,12 0,33
"s4" "ala" "ma" "kota"
CAPABILITY MATRIX
38APACHE BEAM — BACKENDS
CAPABILITY MATRIX
39APACHE BEAM — BACKENDS
APACHE BEAM – RUN
mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
40
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"

-Pspark-runner
41
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp 
--inputFile=gs://apache-beam-samples/shakespeare/* 

-—output=gs://bb/counts” 
-Pdataflow-runner
42
APACHE BEAM – USE CASE’Y
▸ ETL
▸ Fraud detection
▸ Ads pricing (similar: Uber pricing)
▸ Sentiment analysis
43
LINKS
▸ Google Dataflow paper: https://research.google.com/pubs/
pub43864.html
▸ Apache Beam: https://beam.apache.org/
44
Q&A
THANK YOU
Q&A
THANK YOU
We are hiring ;-)

https://goo.gl/zzqXLS

More Related Content

What's hot

Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with TerraformMario IC
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0InfluxData
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
Monitoring InfluxEnterprise
Monitoring InfluxEnterpriseMonitoring InfluxEnterprise
Monitoring InfluxEnterpriseInfluxData
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Web 2 .Zero Programming Providers
Web 2 .Zero Programming ProvidersWeb 2 .Zero Programming Providers
Web 2 .Zero Programming Providersebooker97
 
Multinode kubernetes-cluster
Multinode kubernetes-clusterMultinode kubernetes-cluster
Multinode kubernetes-clusterRam Nath
 
Into to Docker (Central PA Java User Group - 8/14/2017)
Into to Docker (Central PA Java User Group - 8/14/2017)Into to Docker (Central PA Java User Group - 8/14/2017)
Into to Docker (Central PA Java User Group - 8/14/2017)Mike Melusky
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Docker & FieldAware
Docker & FieldAwareDocker & FieldAware
Docker & FieldAwareJakub Jarosz
 
Mercurial for Kittens
Mercurial for KittensMercurial for Kittens
Mercurial for Kittensnya3jp
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
INFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPTINFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPTInfluxData
 
Check the version with fixes. Link in description
Check the version with fixes. Link in descriptionCheck the version with fixes. Link in description
Check the version with fixes. Link in descriptionPrzemyslaw Koltermann
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Background Audio Playback
Background Audio PlaybackBackground Audio Playback
Background Audio PlaybackKorhan Bircan
 
Kubernetes Nedir?
Kubernetes Nedir?Kubernetes Nedir?
Kubernetes Nedir?AnkaraCloud
 

What's hot (20)

Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with Terraform
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
Monitoring InfluxEnterprise
Monitoring InfluxEnterpriseMonitoring InfluxEnterprise
Monitoring InfluxEnterprise
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Web 2 .Zero Programming Providers
Web 2 .Zero Programming ProvidersWeb 2 .Zero Programming Providers
Web 2 .Zero Programming Providers
 
What's New In JDK 10
What's New In JDK 10What's New In JDK 10
What's New In JDK 10
 
Multinode kubernetes-cluster
Multinode kubernetes-clusterMultinode kubernetes-cluster
Multinode kubernetes-cluster
 
Into to Docker (Central PA Java User Group - 8/14/2017)
Into to Docker (Central PA Java User Group - 8/14/2017)Into to Docker (Central PA Java User Group - 8/14/2017)
Into to Docker (Central PA Java User Group - 8/14/2017)
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Docker & FieldAware
Docker & FieldAwareDocker & FieldAware
Docker & FieldAware
 
Sorter
SorterSorter
Sorter
 
My life as a beekeeper
My life as a beekeeperMy life as a beekeeper
My life as a beekeeper
 
Mercurial for Kittens
Mercurial for KittensMercurial for Kittens
Mercurial for Kittens
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
INFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPTINFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPT
 
Check the version with fixes. Link in description
Check the version with fixes. Link in descriptionCheck the version with fixes. Link in description
Check the version with fixes. Link in description
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Background Audio Playback
Background Audio PlaybackBackground Audio Playback
Background Audio Playback
 
Kubernetes Nedir?
Kubernetes Nedir?Kubernetes Nedir?
Kubernetes Nedir?
 

Similar to Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraPiotr Wikiel
 
solving little problems
solving little problemssolving little problems
solving little problemsAustin Ziegler
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce Amazon Web Services
 
Fullstack conf 2017 - Basic dev pipeline end-to-end
Fullstack conf 2017 - Basic dev pipeline end-to-endFullstack conf 2017 - Basic dev pipeline end-to-end
Fullstack conf 2017 - Basic dev pipeline end-to-endEzequiel Maraschio
 
Docker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupDocker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupJérôme Petazzoni
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangaloresrikanthhadoop
 
Zero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudZero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudJames Heggs
 
How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)OPNFV
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Railsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshareRailsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slidesharetomcopeland
 
Complex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxComplex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxbobmcwhirter
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to CascadingBen Speakmon
 
Real world scala
Real world scalaReal world scala
Real world scalalunfu zhong
 

Similar to Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018 (20)

Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
 
solving little problems
solving little problemssolving little problems
solving little problems
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Fullstack conf 2017 - Basic dev pipeline end-to-end
Fullstack conf 2017 - Basic dev pipeline end-to-endFullstack conf 2017 - Basic dev pipeline end-to-end
Fullstack conf 2017 - Basic dev pipeline end-to-end
 
4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 
The state of the swarm
The state of the swarmThe state of the swarm
The state of the swarm
 
Angular Schematics
Angular SchematicsAngular Schematics
Angular Schematics
 
Docker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing MeetupDocker Tips And Tricks at the Docker Beijing Meetup
Docker Tips And Tricks at the Docker Beijing Meetup
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Zero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudZero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google Cloud
 
How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)How Many Ohs? (An Integration Guide to Apex & Triple-o)
How Many Ohs? (An Integration Guide to Apex & Triple-o)
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Railsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshareRailsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshare
 
Complex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxComplex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBox
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to Cascading
 
Real world scala
Real world scalaReal world scala
Real world scala
 
London HUG 12/4
London HUG 12/4London HUG 12/4
London HUG 12/4
 

Recently uploaded

Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILNatan Silnitsky
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
 

Recently uploaded (20)

Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

  • 1. APACHE BEAM – THE DATA ENGINEER’S HOPE Robert Mroczkowski, Piotr Wikieł Voyager 1 — „Pale blue dot”. NASA, 14 lutego 1990
  • 2. ABOUT US ▸ Data Platform Engineers at Allegro ▸ Maintaining probably one of the largest Hadoop cluster in Poland ▸ We use public clouds for data processing on a daily basis ▸ Roots: ▸ Robert — sysop ▸ Piotr — dev 2 VegeTables
  • 3. AGENDA ▸ ETL processes and Lambda Architecture ▸ Apache Beam framework foundations ▸ Transformations, windows, tags, etc. ▸ Batch and streaming ▸ Examples, use cases 3
  • 4.
  • 5. BUT IN OUR PREVIOUS DB DATA HAD BEEN ARRIVING SECONDS (NOT HOURS) AFTER IT WAS PRODUCED… Jane Doe, Department of Analytics, Company Ltd. 5
  • 6. LAMBDA ARCHITECTURE ▸ Kafka — source ▸ Hadoop — batch ▸ Flink — speed ▸ Druid — serving 6
  • 7. LAMBDA ARCHITECTURE ▸ Complicated, huh? ▸ We have to build separate software for real-time and batch computations ▸ … which have to be maintained, probably by different teams ▸ Why not use one tool to rule them all? 7
  • 9. APACHE 
 BEAM UNIFIED MODEL FOR EXECUTING BOTH BATCH AND STREAM DATA PROCESSING PIPELINES
  • 10. APACHE 
 BEAM UNIFIED MODEL FOR EXECUTING BOTH BATCH AND STREAM DATA PROCESSING PIPELINES [whip sound]
  • 11. APACHE BEAM ▸ Born in Google, and then open-sourced ▸ Designed especially for ETL pipelines ▸ Use for both streaming and batch processing ▸ Heavily parallel processing ▸ Exactly once semantics 11
  • 12. IN CODE ▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct) ▸ Java (rich) and Python (poor but pretty) SDK ▸ Open-source Scala API (github –> spotify/scio) 12APACHE BEAM
  • 13. APACHE BEAM FRAMEWORK ▸ Pipeline ▸ Input/Output ▸ PCollection — distributed data representation (Spark RDD- like) ▸ Transofrmation — set of operations on data / usually single operation 13
  • 14. TRANSFORMATIONS ▸ ParDo — like map in MapReduce ▸ Filter elements of PCollection ▸ Format values in PCollection ▸ Cast types ▸ Computations on each single element ▸ collection.apply(ParDo.of(SomeDoFn())) 14APACHE BEAM FRAMEWORK FOUNDATIONS
  • 15. TRANSFORMATIONS ▸ GroupByKey ▸ group values of k/v pairs for the same key ▸ like Shuffle phase in Map Reduce ▸ For streaming - windowing or triggers are necessary 15APACHE BEAM FRAMEWORK FOUNDATIONS
  • 16. TRANSFORMATIONS ▸ CoGroupByKey ▸ join values of k/v pairs for the same key for separate PCollection ▸ .apply(CoGroupByKey.create()) ▸ For streaming — windowing or triggers are necessary 16APACHE BEAM FRAMEWORK FOUNDATIONS
  • 17. TRANSFORMATIONS ▸ Combine ▸ Reduce from Map Reduce paradigm ▸ Combines all elements in PCollection ▸ Combines elements for specific key in k/v pairs ▸ For streaming accumulates elements per window 17APACHE BEAM FRAMEWORK FOUNDATIONS
  • 18. TRANSFORMATIONS ▸ Flatten ▸ Merge several PCollections ▸ Partition ▸ Split PCollection 18APACHE BEAM FRAMEWORK FOUNDATIONS
  • 19. TRANSFORMATIONS ▸ Several already defined transformations: ▸ Filter.By ▸ Count ▸ Custom Transformations ▸ Serializable ▸ Thread-compatible ▸ Idempotent 19APACHE BEAM FRAMEWORK FOUNDATIONS
  • 20. TAGGED OUTPUT 20APACHE BEAM FRAMEWORK FOUNDATIONS
  • 21. SIDE INPUT – ENRICHMENT ▸ Additional data in ParDo ▸ Computed at runtime 21APACHE BEAM FRAMEWORK FOUNDATIONS
  • 22. IO 22APACHE BEAM FRAMEWORK FOUNDATIONS FILE MESSAGING DATABASE HDFS Kinesis Cassandra GCS Kafka Hbase S3 PubSub Hive Local JMS BigQuery Avro MQTT BigTable Text DataStore TFRecord Spanner XML Mongo Tika Redis Solr
  • 24. APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jug juz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna 24
  • 25. APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jug juz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT jug -> 10k, java -> 4M, juzwiosna -> 100 25
  • 26. APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jug juz dzis; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT EXPAND jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 26
  • 27. APACHE BEAM TWEETS Predictions Tweets READS WRITES #juzwiosna; #jug juz dzis; #java 10 GA released {j->[java, jug, juzwiosna], ju->[jug, juzwiosna]} EXTRACT jug, java, juzwiosna COUNT EXPAND TOP(3) jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 27
  • 28. APACHE BEAM TWEETS — BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(TextIO.Read.from("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(TextIO.Write.to("...");
 p.run(); 28
  • 29. APACHE BEAM FRAMEWORK – STREAMING ▸ Windows ▸ One global window by default ▸ Applied for group, combine or output trasnformations ▸ GroupByKey — data is grouped by both key and window 29
  • 33. APACHE BEAM TWEETS – BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(PubsubIO.Write.topic(„...");
 p.run(); 33
  • 34. APACHE BEAM TWEETS – STREAMING READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(Window.into(SlidingWindows.of(
 Duration.standardMinutes(60))))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(PubsubIO.Write.topic(”...”);
 p.run(); 34
  • 35. APACHE BEAM FRAMEWORK – STREAMING ▸ Watermarks ▸ Simple: lag between event timestamp and processing time ▸ Beam keeps track of watermark ▸ When window past watermark data is considered late and discarded ▸ Allow for lateness 35
  • 36. APACHE BEAM FRAMEWORK – STREAMING ▸ Triggers ▸ Change default windowing behaviour ▸ Completness / Latency / Cost ▸ Event Time / Processing Time / Data 36
  • 37. APACHE BEAM FRAMEWORK – STATEFUL PROCESSING 37 (k1,w1) (k2,w2) (k3,w3) "s1" 12 33 -5 "s2" "kot" "pies" "okoń" "s3" 0,03 0,12 0,33 "s4" "ala" "ma" "kota"
  • 40. APACHE BEAM – RUN mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner 40
  • 41. APACHE BEAM – RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"
 -Pspark-runner 41
  • 42. APACHE BEAM – RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp --inputFile=gs://apache-beam-samples/shakespeare/* 
 -—output=gs://bb/counts” -Pdataflow-runner 42
  • 43. APACHE BEAM – USE CASE’Y ▸ ETL ▸ Fraud detection ▸ Ads pricing (similar: Uber pricing) ▸ Sentiment analysis 43
  • 44. LINKS ▸ Google Dataflow paper: https://research.google.com/pubs/ pub43864.html ▸ Apache Beam: https://beam.apache.org/ 44
  • 46. Q&A THANK YOU We are hiring ;-)
 https://goo.gl/zzqXLS