The Flink - Apache Bigtop integration

1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
Flink PMC member
@MartonBalassi | mbalassi@cloudera.com
The Flink - Apache Bigtop integration

Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flink source to linux packages
• Implementing BigPetStore
• From linux packages to Cloudera parcels
• Summary

Short introduction to Bigtop

What is Bigtop?
Apache project for standardizing testing, packaging and integration of
leading big data components.

Components as building blocks
And many more …

Dependency hell
---------------------------------------------------------------
----------hdfs
zookeeper
hbase
kafka
spark
.
.
.
mapred
oozie
hive
etc
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
Build all the
Things!!!

Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide a common foundation for proper integration of growing number of
Hadoop family components
• Foundation provides solid base for validating applications running on top of the
stack(s)
• Neutral packaging and deployment/config

Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Leveraged by app providers
…

Adding more components
…

New focus and target groups
• Going way beyond just building debs/rpms
• Data engineers vs distro builders
• Enhance Operations/Deployment
• Reference implementations & tutorials

An even shorter intro to Flink

The Flink stack
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.

Flink in the wild
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration &
distribution platform
See talks by at
Runs their fork of Flink on
1000+ nodes

From Flink source
to linux packages

The Bigtop component build
• Bigtop builds the component (potentially after patching it)
• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)
• Adds users, groups, systemd services for the components
• Sets up the paths and alternatives for convenient access
• Builds the debs/rpm, takes care of the dependencies
http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html

Implementing BigPetStore

BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet and Table APIs
• Matrix factorization with FlinkML
• Recommendation with the DataStream API

BigPetStore
• Blueprints for Big Data
applications
• Consists of:
• Data Generators
• Examples using tools in Big Data ecosystem
to process data
• Build system and tests for integrating tools
and multiple JVM languages
• Part of the Bigtop project

BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

Data generation
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)

ETL with the DataSet API
• Read the dirty JSON
• Output (customer, product) pairs for the recommender
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)

ETL with Table API
• Read the dirty JSON
• SQL style queries (SQL coming in Flink 1.1)
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]

A little recommender theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)

• Read the (customer, product) pairs
• Write P and Q to file
Matrix factorization with FlinkML
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)

Recommendation with the DataStream API
• Give the TopK recommendation for a user
• (Could be optimized)
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();

From linux packages
to Cloudera parcels

Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs to update parcel without root privileges
• A big, single bundle for the whole ecosystem
• Plays well with the CM services and monitoring
• Package signing
https://github.com/cloudera/cm_ext

Managing the Flink parcel from CM

Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running on YARN is inconvenient like this
Follow [FLINK-4136] for resulotion
• The stand-alone cluster mode runs multiple jobs in the JVM
In practice users fire up clusters per job
Alibaba has a multitenant fork, aim is to contribute
https://www.youtube.com/watch?v=_Nw8NTdIq9A

Next steps – CM services, monitoring

Summary

Summary
• Flink is a dataflow engine with batch and streaming as first class citizens
• Bigtop offers unified packaging, testing and integration
• BigPetStore gives you a blueprint for a range of apps
• It is straight-forward to CM Parcel based on Bigtop

Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Justin Kestelyn
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Bigtop people answering my silly
questions:
Konstantin Boudnik
Roman Shaposhnik
Nate D'Amico
• Squirrels pushing the integration:
Robert Metzger
Fabian Hueske

Check out the code
github.com/mbalassi/bigpetstore-flink
github.com/mbalassi/flink-parcel
Feel free to give me feedback.

Come to Flink Forward

Thank you
@MartonBalassi
mbalassi@cloudera.com

The Flink - Apache Bigtop integration

More Related Content

What's hot

Similar to The Flink - Apache Bigtop integration

Recently uploaded

The Flink - Apache Bigtop integration