1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
Flink PMC member
@MartonBalassi | mbalassi@cloudera.com
The Flink - Apache Bigtop integration
2© Cloudera, Inc. All rights reserved.
Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flink source to linux packages
• Implementing BigPetStore
• From linux packages to Cloudera parcels
• Summary
3© Cloudera, Inc. All rights reserved.
Short introduction to Bigtop
4© Cloudera, Inc. All rights reserved.
What is Bigtop?
Apache project for standardizing testing, packaging and integration of
leading big data components.
5© Cloudera, Inc. All rights reserved.
Components as building blocks
And many more …
6© Cloudera, Inc. All rights reserved.
Dependency hell
---------------------------------------------------------------
----------hdfs
zookeeper
hbase
kafka
spark
.
.
.
mapred
oozie
hive
etc
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
Build all the
Things!!!
7© Cloudera, Inc. All rights reserved.
Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide a common foundation for proper integration of growing number of
Hadoop family components
• Foundation provides solid base for validating applications running on top of the
stack(s)
• Neutral packaging and deployment/config
8© Cloudera, Inc. All rights reserved.
Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Leveraged by app providers
…
9© Cloudera, Inc. All rights reserved.
Adding more components
…
10© Cloudera, Inc. All rights reserved.
New focus and target groups
• Going way beyond just building debs/rpms
• Data engineers vs distro builders
• Enhance Operations/Deployment
• Reference implementations & tutorials
11© Cloudera, Inc. All rights reserved.
An even shorter intro to Flink
12© Cloudera, Inc. All rights reserved.
The Flink stack
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
13© Cloudera, Inc. All rights reserved.
Flink in the wild
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration &
distribution platform
See talks by at
Runs their fork of Flink on
1000+ nodes
14© Cloudera, Inc. All rights reserved.
From Flink source
to linux packages
15© Cloudera, Inc. All rights reserved.
The Bigtop component build
• Bigtop builds the component (potentially after patching it)
• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)
• Adds users, groups, systemd services for the components
• Sets up the paths and alternatives for convenient access
• Builds the debs/rpm, takes care of the dependencies
http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
16© Cloudera, Inc. All rights reserved.
Implementing BigPetStore
17© Cloudera, Inc. All rights reserved.
BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet and Table APIs
• Matrix factorization with FlinkML
• Recommendation with the DataStream API
18© Cloudera, Inc. All rights reserved.
BigPetStore
• Blueprints for Big Data
applications
• Consists of:
• Data Generators
• Examples using tools in Big Data ecosystem
to process data
• Build system and tests for integrating tools
and multiple JVM languages
• Part of the Bigtop project
19© Cloudera, Inc. All rights reserved.
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
20© Cloudera, Inc. All rights reserved.
Data generation
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
21© Cloudera, Inc. All rights reserved.
ETL with the DataSet API
• Read the dirty JSON
• Output (customer, product) pairs for the recommender
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
22© Cloudera, Inc. All rights reserved.
ETL with Table API
• Read the dirty JSON
• SQL style queries (SQL coming in Flink 1.1)
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
23© Cloudera, Inc. All rights reserved.
A little recommender theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)
24© Cloudera, Inc. All rights reserved.
• Read the (customer, product) pairs
• Write P and Q to file
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
25© Cloudera, Inc. All rights reserved.
Recommendation with the DataStream API
• Give the TopK recommendation for a user
• (Could be optimized)
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
26© Cloudera, Inc. All rights reserved.
From linux packages
to Cloudera parcels
27© Cloudera, Inc. All rights reserved.
Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs to update parcel without root privileges
• A big, single bundle for the whole ecosystem
• Plays well with the CM services and monitoring
• Package signing
https://github.com/cloudera/cm_ext
28© Cloudera, Inc. All rights reserved.
Managing the Flink parcel from CM
29© Cloudera, Inc. All rights reserved.
Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running on YARN is inconvenient like this
Follow [FLINK-4136] for resulotion
• The stand-alone cluster mode runs multiple jobs in the JVM
In practice users fire up clusters per job
Alibaba has a multitenant fork, aim is to contribute
https://www.youtube.com/watch?v=_Nw8NTdIq9A
30© Cloudera, Inc. All rights reserved.
Next steps – CM services, monitoring
31© Cloudera, Inc. All rights reserved.
Summary
32© Cloudera, Inc. All rights reserved.
Summary
• Flink is a dataflow engine with batch and streaming as first class citizens
• Bigtop offers unified packaging, testing and integration
• BigPetStore gives you a blueprint for a range of apps
• It is straight-forward to CM Parcel based on Bigtop
33© Cloudera, Inc. All rights reserved.
Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Justin Kestelyn
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Bigtop people answering my silly
questions:
Konstantin Boudnik
Roman Shaposhnik
Nate D'Amico
• Squirrels pushing the integration:
Robert Metzger
Fabian Hueske
34© Cloudera, Inc. All rights reserved.
Check out the code
github.com/mbalassi/bigpetstore-flink
github.com/mbalassi/flink-parcel
Feel free to give me feedback.
35© Cloudera, Inc. All rights reserved.
Come to Flink Forward
36© Cloudera, Inc. All rights reserved.
Thank you
@MartonBalassi
mbalassi@cloudera.com

The Flink - Apache Bigtop integration

  • 1.
    1© Cloudera, Inc.All rights reserved. Marton Balassi | Solutions Architect Flink PMC member @MartonBalassi | mbalassi@cloudera.com The Flink - Apache Bigtop integration
  • 2.
    2© Cloudera, Inc.All rights reserved. Outline • Short introduction to Bigtop • An even shorter intro to Flink • From Flink source to linux packages • Implementing BigPetStore • From linux packages to Cloudera parcels • Summary
  • 3.
    3© Cloudera, Inc.All rights reserved. Short introduction to Bigtop
  • 4.
    4© Cloudera, Inc.All rights reserved. What is Bigtop? Apache project for standardizing testing, packaging and integration of leading big data components.
  • 5.
    5© Cloudera, Inc.All rights reserved. Components as building blocks And many more …
  • 6.
    6© Cloudera, Inc.All rights reserved. Dependency hell --------------------------------------------------------------- ----------hdfs zookeeper hbase kafka spark . . . mapred oozie hive etc --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- Build all the Things!!!
  • 7.
    7© Cloudera, Inc.All rights reserved. Early value added • Bigtop has been around since the 0.20 days of Hadoop • Provide a common foundation for proper integration of growing number of Hadoop family components • Foundation provides solid base for validating applications running on top of the stack(s) • Neutral packaging and deployment/config
  • 8.
    8© Cloudera, Inc.All rights reserved. Early mission accomplished • Foundation for commercial Hadoop distros/services • Leveraged by app providers …
  • 9.
    9© Cloudera, Inc.All rights reserved. Adding more components …
  • 10.
    10© Cloudera, Inc.All rights reserved. New focus and target groups • Going way beyond just building debs/rpms • Data engineers vs distro builders • Enhance Operations/Deployment • Reference implementations & tutorials
  • 11.
    11© Cloudera, Inc.All rights reserved. An even shorter intro to Flink
  • 12.
    12© Cloudera, Inc.All rights reserved. The Flink stack DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 13.
    13© Cloudera, Inc.All rights reserved. Flink in the wild 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at Runs their fork of Flink on 1000+ nodes
  • 14.
    14© Cloudera, Inc.All rights reserved. From Flink source to linux packages
  • 15.
    15© Cloudera, Inc.All rights reserved. The Bigtop component build • Bigtop builds the component (potentially after patching it) • Breaks up the files to linux distro friendly way (/etc/flink/conf, …) • Adds users, groups, systemd services for the components • Sets up the paths and alternatives for convenient access • Builds the debs/rpm, takes care of the dependencies http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
  • 16.
    16© Cloudera, Inc.All rights reserved. Implementing BigPetStore
  • 17.
    17© Cloudera, Inc.All rights reserved. BigPetStore Outline • BigPetStore model • Data generator with the DataSet API • ETL with the DataSet and Table APIs • Matrix factorization with FlinkML • Recommendation with the DataStream API
  • 18.
    18© Cloudera, Inc.All rights reserved. BigPetStore • Blueprints for Big Data applications • Consists of: • Data Generators • Examples using tools in Big Data ecosystem to process data • Build system and tests for integrating tools and multiple JVM languages • Part of the Bigtop project
  • 19.
    19© Cloudera, Inc.All rights reserved. BigPetStore model • Customers visiting pet stores generating transactions, location based Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  • 20.
    20© Cloudera, Inc.All rights reserved. Data generation • Use RJ Nowling’s Java generator classes • Write transactions to JSON val env = ExecutionEnvironment.getExecutionEnvironment val (stores, products, customers) = getData() val startTime = getCurrentMillis() val transactions = env.fromCollection(customers) .flatMap(new TransactionGenerator(products)) .withBroadcastSet(stores, ”stores”) .map{t => t.setDateTime(t.getDateTime + startTime); t} transactions.writeAsText(output)
  • 21.
    21© Cloudera, Inc.All rights reserved. ETL with the DataSet API • Read the dirty JSON • Output (customer, product) pairs for the recommender val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val productsWithIndex = transactions.flatMap(_.getProducts) .distinct .zipWithUniqueId val customerAndProductPairs = transactions .flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p))) .join(productsWithIndex).where(_._2).equalTo(_._2) .map(pair => (pair._1._1, pair._2._1)) .distinct customerAndProductPairs.writeAsCsv(output)
  • 22.
    22© Cloudera, Inc.All rights reserved. ETL with Table API • Read the dirty JSON • SQL style queries (SQL coming in Flink 1.1) val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val table = transactions.map(toCaseClass(_)).toTable val storeTransactionCount = table.groupBy('storeId) .select('storeId, 'storeName, 'storeId.count as 'count) val bestStores = table.groupBy('storeId) .select('storeId.max as 'max) .join(storeTransactionCount) .where(”count = max”) .select('storeId, 'storeName, 'storeId.count as 'count) .toDataSet[StoreCount]
  • 23.
    23© Cloudera, Inc.All rights reserved. A little recommender theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  • 24.
    24© Cloudera, Inc.All rights reserved. • Read the (customer, product) pairs • Write P and Q to file Matrix factorization with FlinkML val env = ExecutionEnvironment.getExecutionEnvironment val input = env.readCsvFile[(Int,Int)](inputFile) .map(pair => (pair._1, pair._2, 1.0)) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(input) val (p, q) = model.factorsOption.get p.writeAsText(pOut) q.writeAsText(qOut)
  • 25.
    25© Cloudera, Inc.All rights reserved. Recommendation with the DataStream API • Give the TopK recommendation for a user • (Could be optimized) StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); env.socketTextStream(”localhost”, 9999) .map(new GetUserVector()) .broadcast() .map(new PartialTopK()) .keyBy(0) .flatMap(new GlobalTopK()) .print();
  • 26.
    26© Cloudera, Inc.All rights reserved. From linux packages to Cloudera parcels
  • 27.
    27© Cloudera, Inc.All rights reserved. Why parcels? • We have linux packages, why a new format? • Cloudera Manager needs to update parcel without root privileges • A big, single bundle for the whole ecosystem • Plays well with the CM services and monitoring • Package signing https://github.com/cloudera/cm_ext
  • 28.
    28© Cloudera, Inc.All rights reserved. Managing the Flink parcel from CM
  • 29.
    29© Cloudera, Inc.All rights reserved. Next steps – Flink operations • Flink does not offer a HistoryServer yet Running on YARN is inconvenient like this Follow [FLINK-4136] for resulotion • The stand-alone cluster mode runs multiple jobs in the JVM In practice users fire up clusters per job Alibaba has a multitenant fork, aim is to contribute https://www.youtube.com/watch?v=_Nw8NTdIq9A
  • 30.
    30© Cloudera, Inc.All rights reserved. Next steps – CM services, monitoring
  • 31.
    31© Cloudera, Inc.All rights reserved. Summary
  • 32.
    32© Cloudera, Inc.All rights reserved. Summary • Flink is a dataflow engine with batch and streaming as first class citizens • Bigtop offers unified packaging, testing and integration • BigPetStore gives you a blueprint for a range of apps • It is straight-forward to CM Parcel based on Bigtop
  • 33.
    33© Cloudera, Inc.All rights reserved. Big thanks to • Clouderans supporting the project: Sean Owen Alexander Bartfeld Justin Kestelyn • The BigPetStore folks: Suneel Marthi Ronald J. Nowling Jay Vyas • Bigtop people answering my silly questions: Konstantin Boudnik Roman Shaposhnik Nate D'Amico • Squirrels pushing the integration: Robert Metzger Fabian Hueske
  • 34.
    34© Cloudera, Inc.All rights reserved. Check out the code github.com/mbalassi/bigpetstore-flink github.com/mbalassi/flink-parcel Feel free to give me feedback.
  • 35.
    35© Cloudera, Inc.All rights reserved. Come to Flink Forward
  • 36.
    36© Cloudera, Inc.All rights reserved. Thank you @MartonBalassi mbalassi@cloudera.com