SlideShare a Scribd company logo
1 of 47
Download to read offline
APACHE BEAM –
THE DATA
ENGINEER’S
HOPE
Robert Mroczkowski,
Piotr Wikieł
ABOUT US
▸ Data Platform Engineers at Allegro
▸ Maintaining probably one of the
largest Hadoop cluster in Poland
▸ We use public clouds for data
processing on a daily basis
▸ Both interested in ML
▸ Roots:
▸ Robert — sysop
▸ Piotr — dev
2
VegeTables
AGENDA
▸ ETL and Lambda Architecture
▸ Apache Beam framework foundations
▸ Transformations, windows, tags, etc.
▸ Batch and streaming
▸ Examples, use cases
3
LAMBDA ARCHITECTURE
BATCH SERVING
SPEED
DATA
QUERY
QUERY
LAMBDA ARCHITECTURE
BATCH SERVING
SPEED
QUERY
QUERY
DATA
Spark Druid
Spark/Flink
Kafka
Analyst
Microservice
LAMBDA ARCHITECTURE
▸ Complicated, huh?
▸ We have to build separate software for real-time and batch
computations
▸ … which have to be maintained, probably by different
teams
▸ Why not use one tool to rule them all?
6
APACHE
BEAM
APACHE 

BEAM
UNIFIED MODEL FOR EXECUTING BOTH BATCH
AND STREAM DATA PROCESSING PIPELINES
APACHE BEAM
▸ Born in Google, and then open-sourced
▸ Designed especially for ETL pipelines
▸ Use for both streaming and batch processing
▸ Heavily parallel processing
▸ Exactly once semantics
9
IN CODE
▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct)
▸ Java (rich), Python (pretty) SDK and recently added GO
SDK
▸ Experimental SQL on PCollections
▸ Open-source Scala API (github –> spotify/scio)
10APACHE BEAM
APACHE BEAM FRAMEWORK
▸ Pipeline
▸ Input/Output
▸ PCollection — distributed data representation (Spark RDD-
like)
▸ Transformation — operation applied on PCollection
11
PCOLLECTION
▸ Any type but all one type - serializable
▸ Immutable
▸ Any size - bounded, unbounded
▸ Timestamps
12APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ ParDo — like map in MapReduce
▸ Filter elements of PCollection
▸ Format values in PCollection
▸ Cast types
▸ Computations on each single element
▸ collection.apply(ParDo.of(SomeDoFn()))
13APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ GroupByKey
▸ group values of k/v pairs for the same key
▸ like Shuffle phase in Map Reduce
▸ For streaming - windowing or triggers are necessary
14APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ CoGroupByKey
▸ join values of k/v pairs for the same key for separate
PCollection
▸ .apply(CoGroupByKey.create())
▸ For streaming — windowing or triggers are necessary
15APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Combine
▸ Reduce from Map Reduce paradigm
▸ Combines all elements in PCollection
▸ Combines elements for specific key in k/v pairs or entire
PCollection
▸ Comutative & Associative Function
▸ For streaming accumulates elements per window
16APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Flatten
▸ Merge several PCollections
▸ Partition
▸ Split PCollection
▸ Partitioning function(element, numberOfPartitions)
17APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS
▸ Several already defined transformations:
▸ Filter.By
▸ Count
▸ Custom Transformations
▸ Serializable
▸ Thread-compatible
▸ Idempotent
18APACHE BEAM FRAMEWORK FOUNDATIONS
TRANSFORMATIONS – SPLITTABLE DOFN
▸ Split processing one element to many workers
▸ Possibly unbounded result of ParDo’ing one element
▸ Examples:
▸ tail -f logs-directory
▸ running jobs outside of Beam and process result within
it
▸ Currently supported in Dataflow and Flink runners
19APACHE BEAM FRAMEWORK FOUNDATIONS
TAGGED OUTPUT
20APACHE BEAM FRAMEWORK FOUNDATIONS
SIDE INPUT – ENRICHMENT
▸ Additional data in ParDo
▸ Computed at runtime
▸ words.apply(ParDo.of(...).withSideInputs(dataView);
21APACHE BEAM FRAMEWORK FOUNDATIONS
IO
22APACHE BEAM FRAMEWORK FOUNDATIONS
FILE MESSAGING DATABASE
HDFS Kinesis Cassandra
GCS Kafka Hbase
S3 PubSub Hive
Local JMS BigQuery
Avro MQTT BigTable
Text DataStore
TFRecord Spanner
XML Mongo
Tika Redis
ParquetIO Solr
TEKST
SEEN SO FAR
RUNNER
SDK
TRANSFORMS, IO
USER CODE
MODEL
APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
24
APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
EXTRACT jug, java, juzwiosna
25
APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT jug -> 10k, java -> 4M, juzwiosna -> 100
26
APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
27
APACHE BEAM
TWEETS HASHTAGS AUTOCOMPLETE
Predictions
Tweets
READS
WRITES
#juzwiosna; #jug w Warszawie; #java 10 GA released
{j->[java, jug, juzwiosna], ju->[jug, juzwiosna]}
EXTRACT jug, java, juzwiosna
COUNT
EXPAND
TOP(3)
jug -> 10k, java -> 4M, juzwiosna -> 100
{j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 

ju-> [jug -> 10k, juzwiosna -> 100]}
28
APACHE BEAM
TWEETS — BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(TextIO.Read.from("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to("...");

p.run();
29
APACHE BEAM FRAMEWORK – STREAMING
▸ Windows
▸ One global window by default
▸ Applied for group, combine or output transformations
▸ GroupByKey — data is grouped by both key and window
30
WINDOWS
FIXED TIME WINDOWS
31
WINDOWS
SLIDING WINDOWS
32
WINDOWS
SESSION WINDOWS
33
APACHE BEAM
TWEETS – BATCH
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(PubsubIO.Write.topic(„...");

p.run();
34
APACHE BEAM
TWEETS – STREAMING
READS
WRITES
EXTRACT
COUNT
EXPAND
TOP(3)
Pipeline p = Pipeline.create(new PipelineOptions());

p.begin()

.apply(PubsubIO.Read.topic("..."))

.apply(Window.into(SlidingWindows.of(

Duration.standardMinutes(60))))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes()))

.apply(Top.largestPerKey(3))

.apply(CassandraIO.<Hashtag>write());

p.run();
35
APACHE BEAM FRAMEWORK – STREAMING
▸ Watermark is approximate lag between event timestamp
and processing time
▸ Beam keeps track of watermark and use it to fire aggregates
▸ when window passes watermark data is considered late and
is discarded
▸ but... you can allow for lateness
▸ FixedWindows.of(..)

.withAllowedLateness(Duration.standardDays(2))
36
APACHE BEAM FRAMEWORK – STREAMING
▸ Triggers
▸ Change default windowing behaviour
▸ Completness / Latency / Cost
▸ Event Time / Processing Time / Data
37
APACHE BEAM FRAMEWORK – STATEFUL PROCESSING
38
(k1,w1) (k2,w2) (k3,w3)
"s1" 12 33 -5
"s2" "kot" "pies" "okoń"
"s3" 0,03 0,12 0,33
"s4" "ala" "ma" "kota"
CAPABILITY MATRIX
39APACHE BEAM — BACKENDS
CAPABILITY MATRIX
40APACHE BEAM — BACKENDS
APACHE BEAM – RUN
mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
41
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"

-Pspark-runner
42
APACHE BEAM – RUN
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp 
--inputFile=gs://apache-beam-samples/shakespeare/* 

-—output=gs://bb/counts” 
-Pdataflow-runner
43
APACHE BEAM – USE CASE’Y
▸ ETL
▸ Fraud detection
▸ Ads pricing (similar: Uber pricing)
▸ Sentiment analysis
44
LINKS
▸ Google Dataflow paper: https://research.google.com/pubs/
pub43864.html
▸ Apache Beam: https://beam.apache.org/
▸ Design documents: https://wtanaka.com/beam/design-doc
45
Q&A
THANK YOU
Q&A
THANK YOU
We are hiring ;-)

https://goo.gl/zzqXLS

More Related Content

What's hot

Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Terraform Introduction
Terraform IntroductionTerraform Introduction
Terraform Introductionsoniasnowfrog
 
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitterApache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitterApache Zeppelin
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Stephane Jourdan
 
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...More than Applications: (Ab)using Docker to Improve the Portability of Everyt...
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...Dexter Horthy
 
Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Jonathon Brouse
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformRadek Simko
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloAnton Babenko
 
Writing Ansible Modules (DENOG11)
Writing Ansible Modules (DENOG11)Writing Ansible Modules (DENOG11)
Writing Ansible Modules (DENOG11)Martin Schütte
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformIMC Institute
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introductionJason Vance
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingSigmoid
 

What's hot (20)

Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Terraform Introduction
Terraform IntroductionTerraform Introduction
Terraform Introduction
 
Intro to Terraform
Intro to TerraformIntro to Terraform
Intro to Terraform
 
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitterApache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter
Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Final terraform
Final terraformFinal terraform
Final terraform
 
Refactoring terraform
Refactoring terraformRefactoring terraform
Refactoring terraform
 
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
 
Terraform at Scale
Terraform at ScaleTerraform at Scale
Terraform at Scale
 
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...More than Applications: (Ab)using Docker to Improve the Portability of Everyt...
More than Applications: (Ab)using Docker to Improve the Portability of Everyt...
 
Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group Oslo
 
Writing Ansible Modules (DENOG11)
Writing Ansible Modules (DENOG11)Writing Ansible Modules (DENOG11)
Writing Ansible Modules (DENOG11)
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 

Similar to Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera

Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
Faster PHP apps using Queues and Workers
Faster PHP apps using Queues and WorkersFaster PHP apps using Queues and Workers
Faster PHP apps using Queues and WorkersRichard Baker
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgzznate
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Testing Distributed Micro Services. Agile Testing Days 2017
Testing Distributed Micro Services. Agile Testing Days 2017Testing Distributed Micro Services. Agile Testing Days 2017
Testing Distributed Micro Services. Agile Testing Days 2017Carlos Sanchez
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
How to contribute Apache CloudStack
How to contribute Apache CloudStackHow to contribute Apache CloudStack
How to contribute Apache CloudStackGo Chiba
 
Complex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxComplex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxbobmcwhirter
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)Robert Swisher
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce Amazon Web Services
 
Zero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudZero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudJames Heggs
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Stormthe100rabh
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and DrupalPromet Source
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSuzquiano
 

Similar to Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera (20)

Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Faster PHP apps using Queues and Workers
Faster PHP apps using Queues and WorkersFaster PHP apps using Queues and Workers
Faster PHP apps using Queues and Workers
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Testing Distributed Micro Services. Agile Testing Days 2017
Testing Distributed Micro Services. Agile Testing Days 2017Testing Distributed Micro Services. Agile Testing Days 2017
Testing Distributed Micro Services. Agile Testing Days 2017
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
How to contribute Apache CloudStack
How to contribute Apache CloudStackHow to contribute Apache CloudStack
How to contribute Apache CloudStack
 
Complex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBoxComplex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBox
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Zero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google CloudZero to Continuous Delivery on Google Cloud
Zero to Continuous Delivery on Google Cloud
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and Drupal
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMS
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 

Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera

  • 1. APACHE BEAM – THE DATA ENGINEER’S HOPE Robert Mroczkowski, Piotr Wikieł
  • 2. ABOUT US ▸ Data Platform Engineers at Allegro ▸ Maintaining probably one of the largest Hadoop cluster in Poland ▸ We use public clouds for data processing on a daily basis ▸ Both interested in ML ▸ Roots: ▸ Robert — sysop ▸ Piotr — dev 2 VegeTables
  • 3. AGENDA ▸ ETL and Lambda Architecture ▸ Apache Beam framework foundations ▸ Transformations, windows, tags, etc. ▸ Batch and streaming ▸ Examples, use cases 3
  • 5. LAMBDA ARCHITECTURE BATCH SERVING SPEED QUERY QUERY DATA Spark Druid Spark/Flink Kafka Analyst Microservice
  • 6. LAMBDA ARCHITECTURE ▸ Complicated, huh? ▸ We have to build separate software for real-time and batch computations ▸ … which have to be maintained, probably by different teams ▸ Why not use one tool to rule them all? 6
  • 8. APACHE 
 BEAM UNIFIED MODEL FOR EXECUTING BOTH BATCH AND STREAM DATA PROCESSING PIPELINES
  • 9. APACHE BEAM ▸ Born in Google, and then open-sourced ▸ Designed especially for ETL pipelines ▸ Use for both streaming and batch processing ▸ Heavily parallel processing ▸ Exactly once semantics 9
  • 10. IN CODE ▸ Backends (Spark, Flink, Apex, Dataflow, Gearpump, Direct) ▸ Java (rich), Python (pretty) SDK and recently added GO SDK ▸ Experimental SQL on PCollections ▸ Open-source Scala API (github –> spotify/scio) 10APACHE BEAM
  • 11. APACHE BEAM FRAMEWORK ▸ Pipeline ▸ Input/Output ▸ PCollection — distributed data representation (Spark RDD- like) ▸ Transformation — operation applied on PCollection 11
  • 12. PCOLLECTION ▸ Any type but all one type - serializable ▸ Immutable ▸ Any size - bounded, unbounded ▸ Timestamps 12APACHE BEAM FRAMEWORK FOUNDATIONS
  • 13. TRANSFORMATIONS ▸ ParDo — like map in MapReduce ▸ Filter elements of PCollection ▸ Format values in PCollection ▸ Cast types ▸ Computations on each single element ▸ collection.apply(ParDo.of(SomeDoFn())) 13APACHE BEAM FRAMEWORK FOUNDATIONS
  • 14. TRANSFORMATIONS ▸ GroupByKey ▸ group values of k/v pairs for the same key ▸ like Shuffle phase in Map Reduce ▸ For streaming - windowing or triggers are necessary 14APACHE BEAM FRAMEWORK FOUNDATIONS
  • 15. TRANSFORMATIONS ▸ CoGroupByKey ▸ join values of k/v pairs for the same key for separate PCollection ▸ .apply(CoGroupByKey.create()) ▸ For streaming — windowing or triggers are necessary 15APACHE BEAM FRAMEWORK FOUNDATIONS
  • 16. TRANSFORMATIONS ▸ Combine ▸ Reduce from Map Reduce paradigm ▸ Combines all elements in PCollection ▸ Combines elements for specific key in k/v pairs or entire PCollection ▸ Comutative & Associative Function ▸ For streaming accumulates elements per window 16APACHE BEAM FRAMEWORK FOUNDATIONS
  • 17. TRANSFORMATIONS ▸ Flatten ▸ Merge several PCollections ▸ Partition ▸ Split PCollection ▸ Partitioning function(element, numberOfPartitions) 17APACHE BEAM FRAMEWORK FOUNDATIONS
  • 18. TRANSFORMATIONS ▸ Several already defined transformations: ▸ Filter.By ▸ Count ▸ Custom Transformations ▸ Serializable ▸ Thread-compatible ▸ Idempotent 18APACHE BEAM FRAMEWORK FOUNDATIONS
  • 19. TRANSFORMATIONS – SPLITTABLE DOFN ▸ Split processing one element to many workers ▸ Possibly unbounded result of ParDo’ing one element ▸ Examples: ▸ tail -f logs-directory ▸ running jobs outside of Beam and process result within it ▸ Currently supported in Dataflow and Flink runners 19APACHE BEAM FRAMEWORK FOUNDATIONS
  • 20. TAGGED OUTPUT 20APACHE BEAM FRAMEWORK FOUNDATIONS
  • 21. SIDE INPUT – ENRICHMENT ▸ Additional data in ParDo ▸ Computed at runtime ▸ words.apply(ParDo.of(...).withSideInputs(dataView); 21APACHE BEAM FRAMEWORK FOUNDATIONS
  • 22. IO 22APACHE BEAM FRAMEWORK FOUNDATIONS FILE MESSAGING DATABASE HDFS Kinesis Cassandra GCS Kafka Hbase S3 PubSub Hive Local JMS BigQuery Avro MQTT BigTable Text DataStore TFRecord Spanner XML Mongo Tika Redis ParquetIO Solr
  • 24. APACHE BEAM TWEETS HASHTAGS AUTOCOMPLETE Predictions Tweets READS WRITES #juzwiosna; #jug w Warszawie; #java 10 GA released 24
  • 25. APACHE BEAM TWEETS HASHTAGS AUTOCOMPLETE Predictions Tweets READS WRITES #juzwiosna; #jug w Warszawie; #java 10 GA released EXTRACT jug, java, juzwiosna 25
  • 26. APACHE BEAM TWEETS HASHTAGS AUTOCOMPLETE Predictions Tweets READS WRITES #juzwiosna; #jug w Warszawie; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT jug -> 10k, java -> 4M, juzwiosna -> 100 26
  • 27. APACHE BEAM TWEETS HASHTAGS AUTOCOMPLETE Predictions Tweets READS WRITES #juzwiosna; #jug w Warszawie; #java 10 GA released EXTRACT jug, java, juzwiosna COUNT EXPAND jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 27
  • 28. APACHE BEAM TWEETS HASHTAGS AUTOCOMPLETE Predictions Tweets READS WRITES #juzwiosna; #jug w Warszawie; #java 10 GA released {j->[java, jug, juzwiosna], ju->[jug, juzwiosna]} EXTRACT jug, java, juzwiosna COUNT EXPAND TOP(3) jug -> 10k, java -> 4M, juzwiosna -> 100 {j -> [jug -> 10k, java -> 4M, juzwiosna -> 100], 
 ju-> [jug -> 10k, juzwiosna -> 100]} 28
  • 29. APACHE BEAM TWEETS — BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(TextIO.Read.from("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(TextIO.Write.to("...");
 p.run(); 29
  • 30. APACHE BEAM FRAMEWORK – STREAMING ▸ Windows ▸ One global window by default ▸ Applied for group, combine or output transformations ▸ GroupByKey — data is grouped by both key and window 30
  • 34. APACHE BEAM TWEETS – BATCH READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(PubsubIO.Write.topic(„...");
 p.run(); 34
  • 35. APACHE BEAM TWEETS – STREAMING READS WRITES EXTRACT COUNT EXPAND TOP(3) Pipeline p = Pipeline.create(new PipelineOptions());
 p.begin()
 .apply(PubsubIO.Read.topic("..."))
 .apply(Window.into(SlidingWindows.of(
 Duration.standardMinutes(60))))
 .apply(ParDo.of(new ExtractTags()))
 .apply(Count.perElement())
 .apply(ParDo.of(new ExpandPrefixes()))
 .apply(Top.largestPerKey(3))
 .apply(CassandraIO.<Hashtag>write());
 p.run(); 35
  • 36. APACHE BEAM FRAMEWORK – STREAMING ▸ Watermark is approximate lag between event timestamp and processing time ▸ Beam keeps track of watermark and use it to fire aggregates ▸ when window passes watermark data is considered late and is discarded ▸ but... you can allow for lateness ▸ FixedWindows.of(..)
 .withAllowedLateness(Duration.standardDays(2)) 36
  • 37. APACHE BEAM FRAMEWORK – STREAMING ▸ Triggers ▸ Change default windowing behaviour ▸ Completness / Latency / Cost ▸ Event Time / Processing Time / Data 37
  • 38. APACHE BEAM FRAMEWORK – STATEFUL PROCESSING 38 (k1,w1) (k2,w2) (k3,w3) "s1" 12 33 -5 "s2" "kot" "pies" "okoń" "s3" 0,03 0,12 0,33 "s4" "ala" "ma" "kota"
  • 41. APACHE BEAM – RUN mvn compile exec:java —Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner 41
  • 42. APACHE BEAM – RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=SparkRunner --inputFile=pom.xml —output=counts"
 -Pspark-runner 42
  • 43. APACHE BEAM – RUN mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://bb/tmp --inputFile=gs://apache-beam-samples/shakespeare/* 
 -—output=gs://bb/counts” -Pdataflow-runner 43
  • 44. APACHE BEAM – USE CASE’Y ▸ ETL ▸ Fraud detection ▸ Ads pricing (similar: Uber pricing) ▸ Sentiment analysis 44
  • 45. LINKS ▸ Google Dataflow paper: https://research.google.com/pubs/ pub43864.html ▸ Apache Beam: https://beam.apache.org/ ▸ Design documents: https://wtanaka.com/beam/design-doc 45
  • 47. Q&A THANK YOU We are hiring ;-)
 https://goo.gl/zzqXLS