Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

2,761 views

Published on

How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.

Published in: Technology

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

  1. 1. Till Rohrmann Flink committer / data Artisans trohrmann@apache.org @stsffap Computing Recommendations at Extreme Scale with Apache Flink
  2. 2. Recommendations: Collaborative Filtering 1
  3. 3. Recommendations §  Omnipresent nowadays §  Important for user experience and sales 2
  4. 4. Recommending Websites 3
  5. 5. Collaborative Filtering §  Recommend items based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 4 ˆru,i = xu T yi
  6. 6. Rating Matrix §  Explicit or implicit ratings §  Prediction goal: Rating for unseen items 5 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & Princess   Facebook  
  7. 7. Matrix Factorization §  Calculate low rank approximation to obtain latent factors 6 minX,Y ru,i − xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Princess   Facebook   §  = rating of user u for item i §  = latent factors of user u §  = latent factors of item i §  = regularization constant §  = number of rated items for user u §  = number of ratings for item i xu yi λ nu ni ru,i
  8. 8. Alternating Least Squares §  Hard to optimize since we have two variables §  Fixing one variables gives quadratic problem 7 X,Y xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ We  only  need  the  item   vectors  rated  by  user  u  
  9. 9. ALS Algorithm §  Update user matrix 8 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  10. 10. ALS Algorithm contd. §  Update item matrix 9 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  11. 11. ALS Algorithm contd. §  Repeat update step until convergence 10 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  12. 12. Apache Flink 11
  13. 13. What is Apache Flink? Apache  Flink  deep-­‐dive  by   Stephan  Ewen   Tomorrow,  12:20  –  13:00  on   Stage  3   12 Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  14. 14. Why Using Flink for ALS? §  Expressive API §  Pipelined stream processor §  Closed loop iterations §  Operations on managed memory 13
  15. 15. Expressive APIs §  DataSet: Abstraction for distributed data §  Computation specified as sequence of lazily evaluated transformations 14 case  class  Word(word:  String,  frequency:  Int)     val  lines:  DataSet[String]  =  env.readTextFile(…)     lines.flatMap(line  =>  line.split(“  “).map(word  =>  Word(word,  1))    .groupBy(“word”).sum(“frequency”)    .print()  
  16. 16. Program Execution 15 case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  17. 17. Pipelined Stream Processor 16 Avoiding  materializaBon  of   intermediate  results  
  18. 18. Iterate by looping §  for/while loop in client submits one job per iteration step §  Data reuse by caching in memory and/or disk Step Step Step Step Step Client 17
  19. 19. Iterate in the Dataflow 18
  20. 20. Memory Management 19
  21. 21. Memory Management 20
  22. 22. ALS implementations with Apache Flink 21
  23. 23. Naïve Implementation 1.  Join item vectors with ratings 2.  Group on user ID 3.  Compute new user vectors 22 xu = YSu YT + λnuΙ( ) −1 Yru T
  24. 24. Pros and Cons of Naïve ALS §  Pros •  Easy to implement §  Cons •  Item vectors are sent redundantly to network nodes •  Two shuffle steps make execution expensive 23
  25. 25. Blocked ALS Implementation 1.  Create user and item rating blocks 2.  Cache them on worker nodes 3.  Send all item vectors needed by user rating block bundled 4.  Compute block of item vectors 24 Based  on  Spark’s  MLlib  implementaBon  
  26. 26. Pros and Cons of Blocked ALS §  Pros •  Reduces network load by avoiding data duplication •  Caching ratings: Only one shuffle step needed §  Cons •  Duplicates the rating matrix (user block/item block partitioning) 25
  27. 27. Performance Comparison 26 •  40  node  GCE  cluster,   highmem-­‐8   •  10  ALS  iteraBons  with  50   latent  factors   •  RaBng  matrix  has  28  billion   non  zero  entries:  Scale  of   NeAlix  or  SpoCfy  
  28. 28. Machine Learning with FlinkML §  FlinkML contains blocked ALS §  Support for many other tasks •  Clustering •  Regression •  Classification §  scikit-learn like pipeline support 27 val  als  =  ALS()     val  ratingDS  =      env.readCsvFile[(Int,  Int,  Double)](ratingData)     val  parameters  =  ParameterMap()      .add(ALS.Iterations,  10)      .add(ALS.NumFactors,  50)      .add(ALS.Lambda,  1.5)     als.fit(ratingDS,  parameters)     val  testingDS  =  env.readCsvFile[(Int,  Int)](testingData)     val  predictions  =  als.predict(testingDS)  
  29. 29. Closing 28
  30. 30. What Have You Seen? §  How to use collaborative filtering to make recommendations §  Apache Flink, a powerful parallel stream processing engine §  How to use Apache Flink and alternating least squares to factorize really large matrices 29
  31. 31. 30
  32. 32. flink.apache.org @ApacheFlink

×