Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Implementing BigPetStore
A blueprint for Flink users
Márton Balassi
mbalassi@apache.org / @MartonBalassi
Hungarian Academy...
Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet & Table API
• Matrix factorizatio...
BigPetStore
Blueprints for Big Data
applications
Consists of:
• Data Generators
• Examples using tools in Big Data
ecosyst...
BigPetStore model
• Customers visiting pet stores generating
transactions, location based
Nowling, R.J.; Vyas, J., "A Doma...
Data generation
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val s...
ETL with the DataSet API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json)....
ETL with the Table API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).ma...
A little Recommeder theory
Item
factors
User side
information
User-Item matrixUser factors
Item side
information
U
I
P
Q
R...
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,...
Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironmen...
Summary
• Go beyond WordCount with BigPetStore
• Feel free to mix the DataSet, DataStream, FlinkML,
Table APIs in your Fli...
Big thanks to
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Squirrels helping with the code:
Gyula F...
Check out the code
https://github.com/mbalassi/bigpetstore-flink
Márton Balassi
mbalassi@apache.org / @MartonBalassi
Hunga...
Upcoming SlideShare
Loading in …5
×

Implementing BigPetStore with Apache Flink

596 views

Published on

Talk at the Berlin Flink meetup group

Published in: Engineering
  • Be the first to comment

Implementing BigPetStore with Apache Flink

  1. 1. Implementing BigPetStore A blueprint for Flink users Márton Balassi mbalassi@apache.org / @MartonBalassi Hungarian Academy of Sciences
  2. 2. Outline • BigPetStore model • Data generator with the DataSet API • ETL with the DataSet & Table API • Matrix factorization with FlinkML • Recommendation with the DataStream API • Summary
  3. 3. BigPetStore Blueprints for Big Data applications Consists of: • Data Generators • Examples using tools in Big Data ecosystem to process data • Build system and tests for integrating tools and multiple JVM languages Part of the Apache BigTop project
  4. 4. BigPetStore model • Customers visiting pet stores generating transactions, location based Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  5. 5. Data generation val env = ExecutionEnvironment.getExecutionEnvironment val (stores, products, customers) = getData() val startTime = getCurrentMillis() val transactions = env.fromCollection(customers) .flatMap(new TransactionGenerator(products)) .withBroadcastSet(stores, ”stores”) .map{t => t.setDateTime(t.getDateTime + startTime); t} transactions.writeAsText(output) • Use RJ Nowling’s Java generator classes • Write transactions to JSON
  6. 6. ETL with the DataSet API val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val productsWithIndex = transactions.flatMap(_.getProducts) .distinct .zipWithUniqueId val customerAndProductPairs = transactions .flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p))) .join(productsWithIndex).where(_._2).equalTo(_._2) .map(pair => (pair._1._1, pair._2._1)) .distinct customerAndProductPairs.writeAsCsv(output) • Read the dirty JSON • Output (customer, product) pairs for the recommender
  7. 7. ETL with the Table API val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val table = transactions.map(toCaseClass(_)).toTable val storeTransactionCount = table.groupBy('storeId) .select('storeId, 'storeName, 'storeId.count as 'count) val bestStores = table.groupBy('storeId) .select('storeId.max as 'max) .join(storeTransactionCount) .where(”count = max”) .select('storeId, 'storeName, 'storeId.count as 'count) .toDataSet[StoreCount] • Read the dirty JSON • SQL style queries
  8. 8. A little Recommeder theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  9. 9. Matrix factorization with FlinkML val env = ExecutionEnvironment.getExecutionEnvironment val input = env.readCsvFile[(Int,Int)](inputFile) .map(pair => (pair._1, pair._2, 1.0)) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(input) val (p, q) = model.factorsOption.get p.writeAsText(pOut) q.writeAsText(qOut) • Read the (customer, product) pairs • Write P and Q to file
  10. 10. Recommendation with the DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); env.socketTextStream(”localhost”, 9999) .map(new GetUserVector()) .broadcast() .map(new PartialTopK()) .keyBy(0) .flatMap(new GlobalTopK()) .print(); • Get the user’s row for a userID • Compute the distributed TopK of the user’s row ∗ Q
  11. 11. Summary • Go beyond WordCount with BigPetStore • Feel free to mix the DataSet, DataStream, FlinkML, Table APIs in your Flink workflows • Data generation, cleaning, ETL, Machine learning, streaming prediction on top of one engine with under 500 lines of code • Java and Scala APIs work well together • A Flink pet project is always fun. No pun intended.
  12. 12. Big thanks to • The BigPetStore folks: Suneel Marthi Ronald J. Nowling Jay Vyas • Squirrels helping with the code: Gyula Fóra Gábor Gevay Gábor Hermann Fabian Hueske Aljoscha Krettek • And to the whole Flink community 
  13. 13. Check out the code https://github.com/mbalassi/bigpetstore-flink Márton Balassi mbalassi@apache.org / @MartonBalassi Hungarian Academy of Sciences

×