1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
Flink PMC member
@MartonBalassi | mbalassi@clo...
2© Cloudera, Inc. All rights reserved.
Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flin...
3© Cloudera, Inc. All rights reserved.
Short introduction to Bigtop
4© Cloudera, Inc. All rights reserved.
What is Bigtop?
Apache project for standardizing testing, packaging and integration...
5© Cloudera, Inc. All rights reserved.
Components as building blocks
And many more …
6© Cloudera, Inc. All rights reserved.
Dependency hell
---------------------------------------------------------------
---...
7© Cloudera, Inc. All rights reserved.
Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide ...
8© Cloudera, Inc. All rights reserved.
Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Le...
9© Cloudera, Inc. All rights reserved.
Adding more components
…
10© Cloudera, Inc. All rights reserved.
New focus and target groups
• Going way beyond just building debs/rpms
• Data engi...
11© Cloudera, Inc. All rights reserved.
An even shorter intro to Flink
12© Cloudera, Inc. All rights reserved.
The Flink stack
DataStream API
Stream Processing
DataSet API
Batch Processing
Runt...
13© Cloudera, Inc. All rights reserved.
Flink in the wild
30 billion events daily 2 billion events in
10 1Gb machines
Pick...
14© Cloudera, Inc. All rights reserved.
From Flink source
to linux packages
15© Cloudera, Inc. All rights reserved.
The Bigtop component build
• Bigtop builds the component (potentially after patchi...
16© Cloudera, Inc. All rights reserved.
Implementing BigPetStore
17© Cloudera, Inc. All rights reserved.
BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ET...
18© Cloudera, Inc. All rights reserved.
BigPetStore
• Blueprints for Big Data
applications
• Consists of:
• Data Generator...
19© Cloudera, Inc. All rights reserved.
BigPetStore model
• Customers visiting pet stores generating transactions, locatio...
20© Cloudera, Inc. All rights reserved.
Data generation
• Use RJ Nowling’s Java generator classes
• Write transactions to ...
21© Cloudera, Inc. All rights reserved.
ETL with the DataSet API
• Read the dirty JSON
• Output (customer, product) pairs ...
22© Cloudera, Inc. All rights reserved.
ETL with Table API
• Read the dirty JSON
• SQL style queries (SQL coming in Flink ...
23© Cloudera, Inc. All rights reserved.
A little recommender theory
Item
factors
User side
information User-Item matrixUse...
24© Cloudera, Inc. All rights reserved.
• Read the (customer, product) pairs
• Write P and Q to file
Matrix factorization ...
25© Cloudera, Inc. All rights reserved.
Recommendation with the DataStream API
• Give the TopK recommendation for a user
•...
26© Cloudera, Inc. All rights reserved.
From linux packages
to Cloudera parcels
27© Cloudera, Inc. All rights reserved.
Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs ...
28© Cloudera, Inc. All rights reserved.
Managing the Flink parcel from CM
29© Cloudera, Inc. All rights reserved.
Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running o...
30© Cloudera, Inc. All rights reserved.
Next steps – CM services, monitoring
31© Cloudera, Inc. All rights reserved.
Summary
32© Cloudera, Inc. All rights reserved.
Summary
• Flink is a dataflow engine with batch and streaming as first class citiz...
33© Cloudera, Inc. All rights reserved.
Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Ju...
34© Cloudera, Inc. All rights reserved.
Check out the code
github.com/mbalassi/bigpetstore-flink
github.com/mbalassi/flink...
35© Cloudera, Inc. All rights reserved.
Come to Flink Forward
36© Cloudera, Inc. All rights reserved.
Thank you
@MartonBalassi
mbalassi@cloudera.com
Upcoming SlideShare
Loading in …5
×

The Flink - Apache Bigtop integration

219 views

Published on

Flink Bay Area meetup talk on July 19th 2016.

Published in: Engineering
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
219
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

The Flink - Apache Bigtop integration

  1. 1. 1© Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect Flink PMC member @MartonBalassi | mbalassi@cloudera.com The Flink - Apache Bigtop integration
  2. 2. 2© Cloudera, Inc. All rights reserved. Outline • Short introduction to Bigtop • An even shorter intro to Flink • From Flink source to linux packages • Implementing BigPetStore • From linux packages to Cloudera parcels • Summary
  3. 3. 3© Cloudera, Inc. All rights reserved. Short introduction to Bigtop
  4. 4. 4© Cloudera, Inc. All rights reserved. What is Bigtop? Apache project for standardizing testing, packaging and integration of leading big data components.
  5. 5. 5© Cloudera, Inc. All rights reserved. Components as building blocks And many more …
  6. 6. 6© Cloudera, Inc. All rights reserved. Dependency hell --------------------------------------------------------------- ----------hdfs zookeeper hbase kafka spark . . . mapred oozie hive etc --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- Build all the Things!!!
  7. 7. 7© Cloudera, Inc. All rights reserved. Early value added • Bigtop has been around since the 0.20 days of Hadoop • Provide a common foundation for proper integration of growing number of Hadoop family components • Foundation provides solid base for validating applications running on top of the stack(s) • Neutral packaging and deployment/config
  8. 8. 8© Cloudera, Inc. All rights reserved. Early mission accomplished • Foundation for commercial Hadoop distros/services • Leveraged by app providers …
  9. 9. 9© Cloudera, Inc. All rights reserved. Adding more components …
  10. 10. 10© Cloudera, Inc. All rights reserved. New focus and target groups • Going way beyond just building debs/rpms • Data engineers vs distro builders • Enhance Operations/Deployment • Reference implementations & tutorials
  11. 11. 11© Cloudera, Inc. All rights reserved. An even shorter intro to Flink
  12. 12. 12© Cloudera, Inc. All rights reserved. The Flink stack DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  13. 13. 13© Cloudera, Inc. All rights reserved. Flink in the wild 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at Runs their fork of Flink on 1000+ nodes
  14. 14. 14© Cloudera, Inc. All rights reserved. From Flink source to linux packages
  15. 15. 15© Cloudera, Inc. All rights reserved. The Bigtop component build • Bigtop builds the component (potentially after patching it) • Breaks up the files to linux distro friendly way (/etc/flink/conf, …) • Adds users, groups, systemd services for the components • Sets up the paths and alternatives for convenient access • Builds the debs/rpm, takes care of the dependencies http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
  16. 16. 16© Cloudera, Inc. All rights reserved. Implementing BigPetStore
  17. 17. 17© Cloudera, Inc. All rights reserved. BigPetStore Outline • BigPetStore model • Data generator with the DataSet API • ETL with the DataSet and Table APIs • Matrix factorization with FlinkML • Recommendation with the DataStream API
  18. 18. 18© Cloudera, Inc. All rights reserved. BigPetStore • Blueprints for Big Data applications • Consists of: • Data Generators • Examples using tools in Big Data ecosystem to process data • Build system and tests for integrating tools and multiple JVM languages • Part of the Bigtop project
  19. 19. 19© Cloudera, Inc. All rights reserved. BigPetStore model • Customers visiting pet stores generating transactions, location based Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  20. 20. 20© Cloudera, Inc. All rights reserved. Data generation • Use RJ Nowling’s Java generator classes • Write transactions to JSON val env = ExecutionEnvironment.getExecutionEnvironment val (stores, products, customers) = getData() val startTime = getCurrentMillis() val transactions = env.fromCollection(customers) .flatMap(new TransactionGenerator(products)) .withBroadcastSet(stores, ”stores”) .map{t => t.setDateTime(t.getDateTime + startTime); t} transactions.writeAsText(output)
  21. 21. 21© Cloudera, Inc. All rights reserved. ETL with the DataSet API • Read the dirty JSON • Output (customer, product) pairs for the recommender val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val productsWithIndex = transactions.flatMap(_.getProducts) .distinct .zipWithUniqueId val customerAndProductPairs = transactions .flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p))) .join(productsWithIndex).where(_._2).equalTo(_._2) .map(pair => (pair._1._1, pair._2._1)) .distinct customerAndProductPairs.writeAsCsv(output)
  22. 22. 22© Cloudera, Inc. All rights reserved. ETL with Table API • Read the dirty JSON • SQL style queries (SQL coming in Flink 1.1) val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val table = transactions.map(toCaseClass(_)).toTable val storeTransactionCount = table.groupBy('storeId) .select('storeId, 'storeName, 'storeId.count as 'count) val bestStores = table.groupBy('storeId) .select('storeId.max as 'max) .join(storeTransactionCount) .where(”count = max”) .select('storeId, 'storeName, 'storeId.count as 'count) .toDataSet[StoreCount]
  23. 23. 23© Cloudera, Inc. All rights reserved. A little recommender theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  24. 24. 24© Cloudera, Inc. All rights reserved. • Read the (customer, product) pairs • Write P and Q to file Matrix factorization with FlinkML val env = ExecutionEnvironment.getExecutionEnvironment val input = env.readCsvFile[(Int,Int)](inputFile) .map(pair => (pair._1, pair._2, 1.0)) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(input) val (p, q) = model.factorsOption.get p.writeAsText(pOut) q.writeAsText(qOut)
  25. 25. 25© Cloudera, Inc. All rights reserved. Recommendation with the DataStream API • Give the TopK recommendation for a user • (Could be optimized) StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); env.socketTextStream(”localhost”, 9999) .map(new GetUserVector()) .broadcast() .map(new PartialTopK()) .keyBy(0) .flatMap(new GlobalTopK()) .print();
  26. 26. 26© Cloudera, Inc. All rights reserved. From linux packages to Cloudera parcels
  27. 27. 27© Cloudera, Inc. All rights reserved. Why parcels? • We have linux packages, why a new format? • Cloudera Manager needs to update parcel without root privileges • A big, single bundle for the whole ecosystem • Plays well with the CM services and monitoring • Package signing https://github.com/cloudera/cm_ext
  28. 28. 28© Cloudera, Inc. All rights reserved. Managing the Flink parcel from CM
  29. 29. 29© Cloudera, Inc. All rights reserved. Next steps – Flink operations • Flink does not offer a HistoryServer yet Running on YARN is inconvenient like this Follow [FLINK-4136] for resulotion • The stand-alone cluster mode runs multiple jobs in the JVM In practice users fire up clusters per job Alibaba has a multitenant fork, aim is to contribute https://www.youtube.com/watch?v=_Nw8NTdIq9A
  30. 30. 30© Cloudera, Inc. All rights reserved. Next steps – CM services, monitoring
  31. 31. 31© Cloudera, Inc. All rights reserved. Summary
  32. 32. 32© Cloudera, Inc. All rights reserved. Summary • Flink is a dataflow engine with batch and streaming as first class citizens • Bigtop offers unified packaging, testing and integration • BigPetStore gives you a blueprint for a range of apps • It is straight-forward to CM Parcel based on Bigtop
  33. 33. 33© Cloudera, Inc. All rights reserved. Big thanks to • Clouderans supporting the project: Sean Owen Alexander Bartfeld Justin Kestelyn • The BigPetStore folks: Suneel Marthi Ronald J. Nowling Jay Vyas • Bigtop people answering my silly questions: Konstantin Boudnik Roman Shaposhnik Nate D'Amico • Squirrels pushing the integration: Robert Metzger Fabian Hueske
  34. 34. 34© Cloudera, Inc. All rights reserved. Check out the code github.com/mbalassi/bigpetstore-flink github.com/mbalassi/flink-parcel Feel free to give me feedback.
  35. 35. 35© Cloudera, Inc. All rights reserved. Come to Flink Forward
  36. 36. 36© Cloudera, Inc. All rights reserved. Thank you @MartonBalassi mbalassi@cloudera.com

×