Till Rohrmann
Flink committer / data Artisans
trohrmann@apache.org
@stsffap
Computing Recommendations at
Extreme Scale with Apache Flink
Recommendations: Collaborative
Filtering
1
Recommendations
§  Omnipresent nowadays
§  Important for user
experience and sales
2
Recommending Websites
3
Collaborative Filtering
§  Recommend items based on users with
similar preferences
§  Latent factor models capture underlying
characteristics of items and preferences
of user
§  Predicted preference:
4
ˆru,i = xu
T
yi
Rating Matrix
§  Explicit or implicit ratings
§  Prediction goal: Rating for unseen items
5
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
& Princess	
  
Facebook	
  
Matrix Factorization
§  Calculate low rank approximation to obtain latent factors
6
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
2
+ ni yi
2
i
∑
u
∑
#
$
%
&
'
(
ru,i≠0
∑
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Princess	
  
Facebook	
  
§  = rating of user u for item i
§  = latent factors of user u
§  = latent factors of item i
§  = regularization constant
§  = number of rated items for user u
§  = number of ratings for item i
xu
yi
λ
nu
ni
ru,i
Alternating Least Squares
§  Hard to optimize since we have two
variables
§  Fixing one variables gives quadratic
problem
7
X,Y
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
We	
  only	
  need	
  the	
  item	
  
vectors	
  rated	
  by	
  user	
  u	
  
ALS Algorithm
§  Update user matrix
8
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep	
  fixed	
  
Calculate	
  update	
  
ALS Algorithm contd.
§  Update item matrix
9
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep	
  fixed	
  
Calculate	
  update	
  
ALS Algorithm contd.
§  Repeat update step until convergence
10
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep	
  fixed	
  
Calculate	
  update	
  
Apache Flink
11
What is Apache Flink?
Apache	
  Flink	
  deep-­‐dive	
  by	
  
Stephan	
  Ewen	
  
Tomorrow,	
  12:20	
  –	
  13:00	
  on	
  
Stage	
  3	
  
12
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Why Using Flink for ALS?
§  Expressive API
§  Pipelined stream processor
§  Closed loop iterations
§  Operations on managed memory
13
Expressive APIs
§  DataSet: Abstraction for distributed data
§  Computation specified as sequence of lazily
evaluated transformations
14
case	
  class	
  Word(word:	
  String,	
  frequency:	
  Int)	
  
	
  
val	
  lines:	
  DataSet[String]	
  =	
  env.readTextFile(…)	
  
	
  
lines.flatMap(line	
  =>	
  line.split(“	
  “).map(word	
  =>	
  Word(word,	
  1))	
  
	
  .groupBy(“word”).sum(“frequency”)	
  
	
  .print()	
  
Program Execution
15
case	
  class	
  Path	
  (from:	
  Long,	
  to:	
  Long)	
  
val	
  tc	
  =	
  edges.iterate(10)	
  {	
  	
  
	
  	
  paths:	
  DataSet[Path]	
  =>	
  
	
  	
  	
  	
  val	
  next	
  =	
  paths	
  
	
  	
  	
  	
  	
  	
  .join(edges)	
  
	
  	
  	
  	
  	
  	
  .where("to")	
  
	
  	
  	
  	
  	
  	
  .equalTo("from")	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  (path,	
  edge)	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Path(path.from,	
  edge.to)	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  .union(paths)	
  
	
  	
  	
  	
  	
  	
  .distinct()	
  
	
  	
  	
  	
  next	
  
	
  	
  }	
  
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
Pipelined Stream Processor
16
Avoiding	
  materializaBon	
  of	
  
intermediate	
  results	
  
Iterate by looping
§  for/while loop in client submits one job per
iteration step
§  Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
17
Iterate in the Dataflow
18
Memory Management
19
Memory Management
20
ALS implementations with Apache
Flink
21
Naïve Implementation
1.  Join item vectors
with ratings
2.  Group on user ID
3.  Compute new
user vectors
22
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Pros and Cons of Naïve ALS
§  Pros
•  Easy to implement
§  Cons
•  Item vectors are sent redundantly to network
nodes
•  Two shuffle steps make execution expensive
23
Blocked ALS Implementation
1.  Create user and item
rating blocks
2.  Cache them on
worker nodes
3.  Send all item vectors
needed by user
rating block
bundled
4.  Compute block of
item vectors
24
Based	
  on	
  Spark’s	
  MLlib	
  implementaBon	
  
Pros and Cons of Blocked ALS
§  Pros
•  Reduces network load by avoiding data
duplication
•  Caching ratings: Only one shuffle step needed
§  Cons
•  Duplicates the rating matrix (user block/item
block partitioning)
25
Performance Comparison
26
•  40	
  node	
  GCE	
  cluster,	
  
highmem-­‐8	
  
•  10	
  ALS	
  iteraBons	
  with	
  50	
  
latent	
  factors	
  
•  RaBng	
  matrix	
  has	
  28	
  billion	
  
non	
  zero	
  entries:	
  Scale	
  of	
  
NeAlix	
  or	
  SpoCfy	
  
Machine Learning with FlinkML
§  FlinkML contains
blocked ALS
§  Support for many
other tasks
•  Clustering
•  Regression
•  Classification
§  scikit-learn like
pipeline support
27
val	
  als	
  =	
  ALS()	
  
	
  
val	
  ratingDS	
  =	
  
	
  	
  env.readCsvFile[(Int,	
  Int,	
  Double)](ratingData)	
  
	
  
val	
  parameters	
  =	
  ParameterMap()	
  
	
  	
  .add(ALS.Iterations,	
  10)	
  
	
  	
  .add(ALS.NumFactors,	
  50)	
  
	
  	
  .add(ALS.Lambda,	
  1.5)	
  
	
  
als.fit(ratingDS,	
  parameters)	
  
	
  
val	
  testingDS	
  =	
  env.readCsvFile[(Int,	
  Int)](testingData)	
  
	
  
val	
  predictions	
  =	
  als.predict(testingDS)	
  
Closing
28
What Have You Seen?
§  How to use collaborative filtering to make
recommendations
§  Apache Flink, a powerful parallel stream
processing engine
§  How to use Apache Flink and alternating least
squares to factorize really large matrices
29
30
flink.apache.org
@ApacheFlink

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

  • 1.
    Till Rohrmann Flink committer/ data Artisans trohrmann@apache.org @stsffap Computing Recommendations at Extreme Scale with Apache Flink
  • 2.
  • 3.
    Recommendations §  Omnipresent nowadays § Important for user experience and sales 2
  • 4.
  • 5.
    Collaborative Filtering §  Recommenditems based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 4 ˆru,i = xu T yi
  • 6.
    Rating Matrix §  Explicitor implicit ratings §  Prediction goal: Rating for unseen items 5 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & Princess   Facebook  
  • 7.
    Matrix Factorization §  Calculatelow rank approximation to obtain latent factors 6 minX,Y ru,i − xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Princess   Facebook   §  = rating of user u for item i §  = latent factors of user u §  = latent factors of item i §  = regularization constant §  = number of rated items for user u §  = number of ratings for item i xu yi λ nu ni ru,i
  • 8.
    Alternating Least Squares § Hard to optimize since we have two variables §  Fixing one variables gives quadratic problem 7 X,Y xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ We  only  need  the  item   vectors  rated  by  user  u  
  • 9.
    ALS Algorithm §  Updateuser matrix 8 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  • 10.
    ALS Algorithm contd. § Update item matrix 9 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  • 11.
    ALS Algorithm contd. § Repeat update step until convergence 10 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  • 12.
  • 13.
    What is ApacheFlink? Apache  Flink  deep-­‐dive  by   Stephan  Ewen   Tomorrow,  12:20  –  13:00  on   Stage  3   12 Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  • 14.
    Why Using Flinkfor ALS? §  Expressive API §  Pipelined stream processor §  Closed loop iterations §  Operations on managed memory 13
  • 15.
    Expressive APIs §  DataSet:Abstraction for distributed data §  Computation specified as sequence of lazily evaluated transformations 14 case  class  Word(word:  String,  frequency:  Int)     val  lines:  DataSet[String]  =  env.readTextFile(…)     lines.flatMap(line  =>  line.split(“  “).map(word  =>  Word(word,  1))    .groupBy(“word”).sum(“frequency”)    .print()  
  • 16.
    Program Execution 15 case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  • 17.
    Pipelined Stream Processor 16 Avoiding  materializaBon  of   intermediate  results  
  • 18.
    Iterate by looping § for/while loop in client submits one job per iteration step §  Data reuse by caching in memory and/or disk Step Step Step Step Step Client 17
  • 19.
    Iterate in theDataflow 18
  • 20.
  • 21.
  • 22.
    ALS implementations withApache Flink 21
  • 23.
    Naïve Implementation 1.  Joinitem vectors with ratings 2.  Group on user ID 3.  Compute new user vectors 22 xu = YSu YT + λnuΙ( ) −1 Yru T
  • 24.
    Pros and Consof Naïve ALS §  Pros •  Easy to implement §  Cons •  Item vectors are sent redundantly to network nodes •  Two shuffle steps make execution expensive 23
  • 25.
    Blocked ALS Implementation 1. Create user and item rating blocks 2.  Cache them on worker nodes 3.  Send all item vectors needed by user rating block bundled 4.  Compute block of item vectors 24 Based  on  Spark’s  MLlib  implementaBon  
  • 26.
    Pros and Consof Blocked ALS §  Pros •  Reduces network load by avoiding data duplication •  Caching ratings: Only one shuffle step needed §  Cons •  Duplicates the rating matrix (user block/item block partitioning) 25
  • 27.
    Performance Comparison 26 •  40  node  GCE  cluster,   highmem-­‐8   •  10  ALS  iteraBons  with  50   latent  factors   •  RaBng  matrix  has  28  billion   non  zero  entries:  Scale  of   NeAlix  or  SpoCfy  
  • 28.
    Machine Learning withFlinkML §  FlinkML contains blocked ALS §  Support for many other tasks •  Clustering •  Regression •  Classification §  scikit-learn like pipeline support 27 val  als  =  ALS()     val  ratingDS  =      env.readCsvFile[(Int,  Int,  Double)](ratingData)     val  parameters  =  ParameterMap()      .add(ALS.Iterations,  10)      .add(ALS.NumFactors,  50)      .add(ALS.Lambda,  1.5)     als.fit(ratingDS,  parameters)     val  testingDS  =  env.readCsvFile[(Int,  Int)](testingData)     val  predictions  =  als.predict(testingDS)  
  • 29.
  • 30.
    What Have YouSeen? §  How to use collaborative filtering to make recommendations §  Apache Flink, a powerful parallel stream processing engine §  How to use Apache Flink and alternating least squares to factorize really large matrices 29
  • 31.
  • 32.