SlideShare a Scribd company logo
1 of 36
1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
Flink PMC member
@MartonBalassi | mbalassi@cloudera.com
The Flink - Apache Bigtop integration
2© Cloudera, Inc. All rights reserved.
Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flink source to linux packages
• Implementing BigPetStore
• From linux packages to Cloudera parcels
• Summary
3© Cloudera, Inc. All rights reserved.
Short introduction to Bigtop
4© Cloudera, Inc. All rights reserved.
What is Bigtop?
Apache project for standardizing testing, packaging and integration of
leading big data components.
5© Cloudera, Inc. All rights reserved.
Components as building blocks
And many more …
6© Cloudera, Inc. All rights reserved.
Dependency hell
---------------------------------------------------------------
----------hdfs
zookeeper
hbase
kafka
spark
.
.
.
mapred
oozie
hive
etc
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
Build all the
Things!!!
7© Cloudera, Inc. All rights reserved.
Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide a common foundation for proper integration of growing number of
Hadoop family components
• Foundation provides solid base for validating applications running on top of the
stack(s)
• Neutral packaging and deployment/config
8© Cloudera, Inc. All rights reserved.
Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Leveraged by app providers
…
9© Cloudera, Inc. All rights reserved.
Adding more components
…
10© Cloudera, Inc. All rights reserved.
New focus and target groups
• Going way beyond just building debs/rpms
• Data engineers vs distro builders
• Enhance Operations/Deployment
• Reference implementations & tutorials
11© Cloudera, Inc. All rights reserved.
An even shorter intro to Flink
12© Cloudera, Inc. All rights reserved.
The Flink stack
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
13© Cloudera, Inc. All rights reserved.
Flink in the wild
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration &
distribution platform
See talks by at
Runs their fork of Flink on
1000+ nodes
14© Cloudera, Inc. All rights reserved.
From Flink source
to linux packages
15© Cloudera, Inc. All rights reserved.
The Bigtop component build
• Bigtop builds the component (potentially after patching it)
• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)
• Adds users, groups, systemd services for the components
• Sets up the paths and alternatives for convenient access
• Builds the debs/rpm, takes care of the dependencies
http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
16© Cloudera, Inc. All rights reserved.
Implementing BigPetStore
17© Cloudera, Inc. All rights reserved.
BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet and Table APIs
• Matrix factorization with FlinkML
• Recommendation with the DataStream API
18© Cloudera, Inc. All rights reserved.
BigPetStore
• Blueprints for Big Data
applications
• Consists of:
• Data Generators
• Examples using tools in Big Data ecosystem
to process data
• Build system and tests for integrating tools
and multiple JVM languages
• Part of the Bigtop project
19© Cloudera, Inc. All rights reserved.
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
20© Cloudera, Inc. All rights reserved.
Data generation
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
21© Cloudera, Inc. All rights reserved.
ETL with the DataSet API
• Read the dirty JSON
• Output (customer, product) pairs for the recommender
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
22© Cloudera, Inc. All rights reserved.
ETL with Table API
• Read the dirty JSON
• SQL style queries (SQL coming in Flink 1.1)
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
23© Cloudera, Inc. All rights reserved.
A little recommender theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)
24© Cloudera, Inc. All rights reserved.
• Read the (customer, product) pairs
• Write P and Q to file
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
25© Cloudera, Inc. All rights reserved.
Recommendation with the DataStream API
• Give the TopK recommendation for a user
• (Could be optimized)
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
26© Cloudera, Inc. All rights reserved.
From linux packages
to Cloudera parcels
27© Cloudera, Inc. All rights reserved.
Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs to update parcel without root privileges
• A big, single bundle for the whole ecosystem
• Plays well with the CM services and monitoring
• Package signing
https://github.com/cloudera/cm_ext
28© Cloudera, Inc. All rights reserved.
Managing the Flink parcel from CM
29© Cloudera, Inc. All rights reserved.
Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running on YARN is inconvenient like this
Follow [FLINK-4136] for resulotion
• The stand-alone cluster mode runs multiple jobs in the JVM
In practice users fire up clusters per job
Alibaba has a multitenant fork, aim is to contribute
https://www.youtube.com/watch?v=_Nw8NTdIq9A
30© Cloudera, Inc. All rights reserved.
Next steps – CM services, monitoring
31© Cloudera, Inc. All rights reserved.
Summary
32© Cloudera, Inc. All rights reserved.
Summary
• Flink is a dataflow engine with batch and streaming as first class citizens
• Bigtop offers unified packaging, testing and integration
• BigPetStore gives you a blueprint for a range of apps
• It is straight-forward to CM Parcel based on Bigtop
33© Cloudera, Inc. All rights reserved.
Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Justin Kestelyn
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Bigtop people answering my silly
questions:
Konstantin Boudnik
Roman Shaposhnik
Nate D'Amico
• Squirrels pushing the integration:
Robert Metzger
Fabian Hueske
34© Cloudera, Inc. All rights reserved.
Check out the code
github.com/mbalassi/bigpetstore-flink
github.com/mbalassi/flink-parcel
Feel free to give me feedback.
35© Cloudera, Inc. All rights reserved.
Come to Flink Forward
36© Cloudera, Inc. All rights reserved.
Thank you
@MartonBalassi
mbalassi@cloudera.com

More Related Content

What's hot

Cloud stack networking shapeblue technical deep dive
Cloud stack networking   shapeblue technical deep diveCloud stack networking   shapeblue technical deep dive
Cloud stack networking shapeblue technical deep diveShapeBlue
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceSean Cohen
 
OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateStephen Gordon
 
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...OpenStack Korea Community
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...NETWAYS
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...NETWAYS
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overviewhowie YU
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Contrail Virtual Execution Platform
Contrail Virtual Execution PlatformContrail Virtual Execution Platform
Contrail Virtual Execution PlatformNETWAYS
 
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...CloudOps2005
 
Using next gen storage in Cloudstack
Using next gen storage in CloudstackUsing next gen storage in Cloudstack
Using next gen storage in CloudstackShapeBlue
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack CloudQiming Teng
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld
 
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...Sungjin Kang
 
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Cloud Native Day Tel Aviv
 
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpMetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpNicolas Trangez
 
Cf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionCf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionmcollinsCF
 
Introduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage SubsystemIntroduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage Subsystembuildacloud
 

What's hot (20)

Cloud stack networking shapeblue technical deep dive
Cloud stack networking   shapeblue technical deep diveCloud stack networking   shapeblue technical deep dive
Cloud stack networking shapeblue technical deep dive
 
LinuxTag 2013
LinuxTag 2013LinuxTag 2013
LinuxTag 2013
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as service
 
OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community Update
 
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overview
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
Contrail Virtual Execution Platform
Contrail Virtual Execution PlatformContrail Virtual Execution Platform
Contrail Virtual Execution Platform
 
Geode on Docker
Geode on DockerGeode on Docker
Geode on Docker
 
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
 
Using next gen storage in Cloudstack
Using next gen storage in CloudstackUsing next gen storage in Cloudstack
Using next gen storage in Cloudstack
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack Cloud
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
 
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
 
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
 
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpMetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
 
Cf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionCf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusion
 
Introduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage SubsystemIntroduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage Subsystem
 

Similar to The Flink - Apache Bigtop integration

Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesWeaveworks
 
Warsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricWarsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricPatryk Bandurski
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdWeaveworks
 
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAnypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAkshata Sawant
 
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxMuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxSteve Clarke
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...DevOps.com
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherNETWAYS
 
Building managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitBuilding managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitmatsunota
 
Java mission control and java flight recorder
Java mission control and java flight recorderJava mission control and java flight recorder
Java mission control and java flight recorderWolfgang Weigend
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps WorkshopWeaveworks
 

Similar to The Flink - Apache Bigtop integration (20)

Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 
Warsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricWarsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime Fabric
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Galera Cluster 4 for MySQL 8 Release Webinar slides
Galera Cluster 4 for MySQL 8 Release Webinar slidesGalera Cluster 4 for MySQL 8 Release Webinar slides
Galera Cluster 4 for MySQL 8 Release Webinar slides
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and Linkerd
 
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAnypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
 
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxMuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
 
intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
 
Building managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitBuilding managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummit
 
Java mission control and java flight recorder
Java mission control and java flight recorderJava mission control and java flight recorder
Java mission control and java flight recorder
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 

Recently uploaded

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 

Recently uploaded (20)

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 

The Flink - Apache Bigtop integration

  • 1. 1© Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect Flink PMC member @MartonBalassi | mbalassi@cloudera.com The Flink - Apache Bigtop integration
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Short introduction to Bigtop • An even shorter intro to Flink • From Flink source to linux packages • Implementing BigPetStore • From linux packages to Cloudera parcels • Summary
  • 3. 3© Cloudera, Inc. All rights reserved. Short introduction to Bigtop
  • 4. 4© Cloudera, Inc. All rights reserved. What is Bigtop? Apache project for standardizing testing, packaging and integration of leading big data components.
  • 5. 5© Cloudera, Inc. All rights reserved. Components as building blocks And many more …
  • 6. 6© Cloudera, Inc. All rights reserved. Dependency hell --------------------------------------------------------------- ----------hdfs zookeeper hbase kafka spark . . . mapred oozie hive etc --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- Build all the Things!!!
  • 7. 7© Cloudera, Inc. All rights reserved. Early value added • Bigtop has been around since the 0.20 days of Hadoop • Provide a common foundation for proper integration of growing number of Hadoop family components • Foundation provides solid base for validating applications running on top of the stack(s) • Neutral packaging and deployment/config
  • 8. 8© Cloudera, Inc. All rights reserved. Early mission accomplished • Foundation for commercial Hadoop distros/services • Leveraged by app providers …
  • 9. 9© Cloudera, Inc. All rights reserved. Adding more components …
  • 10. 10© Cloudera, Inc. All rights reserved. New focus and target groups • Going way beyond just building debs/rpms • Data engineers vs distro builders • Enhance Operations/Deployment • Reference implementations & tutorials
  • 11. 11© Cloudera, Inc. All rights reserved. An even shorter intro to Flink
  • 12. 12© Cloudera, Inc. All rights reserved. The Flink stack DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 13. 13© Cloudera, Inc. All rights reserved. Flink in the wild 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at Runs their fork of Flink on 1000+ nodes
  • 14. 14© Cloudera, Inc. All rights reserved. From Flink source to linux packages
  • 15. 15© Cloudera, Inc. All rights reserved. The Bigtop component build • Bigtop builds the component (potentially after patching it) • Breaks up the files to linux distro friendly way (/etc/flink/conf, …) • Adds users, groups, systemd services for the components • Sets up the paths and alternatives for convenient access • Builds the debs/rpm, takes care of the dependencies http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
  • 16. 16© Cloudera, Inc. All rights reserved. Implementing BigPetStore
  • 17. 17© Cloudera, Inc. All rights reserved. BigPetStore Outline • BigPetStore model • Data generator with the DataSet API • ETL with the DataSet and Table APIs • Matrix factorization with FlinkML • Recommendation with the DataStream API
  • 18. 18© Cloudera, Inc. All rights reserved. BigPetStore • Blueprints for Big Data applications • Consists of: • Data Generators • Examples using tools in Big Data ecosystem to process data • Build system and tests for integrating tools and multiple JVM languages • Part of the Bigtop project
  • 19. 19© Cloudera, Inc. All rights reserved. BigPetStore model • Customers visiting pet stores generating transactions, location based Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  • 20. 20© Cloudera, Inc. All rights reserved. Data generation • Use RJ Nowling’s Java generator classes • Write transactions to JSON val env = ExecutionEnvironment.getExecutionEnvironment val (stores, products, customers) = getData() val startTime = getCurrentMillis() val transactions = env.fromCollection(customers) .flatMap(new TransactionGenerator(products)) .withBroadcastSet(stores, ”stores”) .map{t => t.setDateTime(t.getDateTime + startTime); t} transactions.writeAsText(output)
  • 21. 21© Cloudera, Inc. All rights reserved. ETL with the DataSet API • Read the dirty JSON • Output (customer, product) pairs for the recommender val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val productsWithIndex = transactions.flatMap(_.getProducts) .distinct .zipWithUniqueId val customerAndProductPairs = transactions .flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p))) .join(productsWithIndex).where(_._2).equalTo(_._2) .map(pair => (pair._1._1, pair._2._1)) .distinct customerAndProductPairs.writeAsCsv(output)
  • 22. 22© Cloudera, Inc. All rights reserved. ETL with Table API • Read the dirty JSON • SQL style queries (SQL coming in Flink 1.1) val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val table = transactions.map(toCaseClass(_)).toTable val storeTransactionCount = table.groupBy('storeId) .select('storeId, 'storeName, 'storeId.count as 'count) val bestStores = table.groupBy('storeId) .select('storeId.max as 'max) .join(storeTransactionCount) .where(”count = max”) .select('storeId, 'storeName, 'storeId.count as 'count) .toDataSet[StoreCount]
  • 23. 23© Cloudera, Inc. All rights reserved. A little recommender theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  • 24. 24© Cloudera, Inc. All rights reserved. • Read the (customer, product) pairs • Write P and Q to file Matrix factorization with FlinkML val env = ExecutionEnvironment.getExecutionEnvironment val input = env.readCsvFile[(Int,Int)](inputFile) .map(pair => (pair._1, pair._2, 1.0)) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(input) val (p, q) = model.factorsOption.get p.writeAsText(pOut) q.writeAsText(qOut)
  • 25. 25© Cloudera, Inc. All rights reserved. Recommendation with the DataStream API • Give the TopK recommendation for a user • (Could be optimized) StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); env.socketTextStream(”localhost”, 9999) .map(new GetUserVector()) .broadcast() .map(new PartialTopK()) .keyBy(0) .flatMap(new GlobalTopK()) .print();
  • 26. 26© Cloudera, Inc. All rights reserved. From linux packages to Cloudera parcels
  • 27. 27© Cloudera, Inc. All rights reserved. Why parcels? • We have linux packages, why a new format? • Cloudera Manager needs to update parcel without root privileges • A big, single bundle for the whole ecosystem • Plays well with the CM services and monitoring • Package signing https://github.com/cloudera/cm_ext
  • 28. 28© Cloudera, Inc. All rights reserved. Managing the Flink parcel from CM
  • 29. 29© Cloudera, Inc. All rights reserved. Next steps – Flink operations • Flink does not offer a HistoryServer yet Running on YARN is inconvenient like this Follow [FLINK-4136] for resulotion • The stand-alone cluster mode runs multiple jobs in the JVM In practice users fire up clusters per job Alibaba has a multitenant fork, aim is to contribute https://www.youtube.com/watch?v=_Nw8NTdIq9A
  • 30. 30© Cloudera, Inc. All rights reserved. Next steps – CM services, monitoring
  • 31. 31© Cloudera, Inc. All rights reserved. Summary
  • 32. 32© Cloudera, Inc. All rights reserved. Summary • Flink is a dataflow engine with batch and streaming as first class citizens • Bigtop offers unified packaging, testing and integration • BigPetStore gives you a blueprint for a range of apps • It is straight-forward to CM Parcel based on Bigtop
  • 33. 33© Cloudera, Inc. All rights reserved. Big thanks to • Clouderans supporting the project: Sean Owen Alexander Bartfeld Justin Kestelyn • The BigPetStore folks: Suneel Marthi Ronald J. Nowling Jay Vyas • Bigtop people answering my silly questions: Konstantin Boudnik Roman Shaposhnik Nate D'Amico • Squirrels pushing the integration: Robert Metzger Fabian Hueske
  • 34. 34© Cloudera, Inc. All rights reserved. Check out the code github.com/mbalassi/bigpetstore-flink github.com/mbalassi/flink-parcel Feel free to give me feedback.
  • 35. 35© Cloudera, Inc. All rights reserved. Come to Flink Forward
  • 36. 36© Cloudera, Inc. All rights reserved. Thank you @MartonBalassi mbalassi@cloudera.com