Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Till Rohrmann
Flink committer / data Artisans
trohrmann@apache.org
@stsffap
Computing Recommendations at
Extreme Scale with Apache Flink

Recommendations: Collaborative
Filtering
1

Recommendations
§  Omnipresent nowadays
§  Important for user
experience and sales
2

Collaborative Filtering
§  Recommend items based on users with
similar preferences
§  Latent factor models capture underlying
characteristics of items and preferences
of user
§  Predicted preference:
4
ˆru,i = xu
T
yi

Rating Matrix
§  Explicit or implicit ratings
§  Prediction goal: Rating for unseen items
5
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
& Princess

Facebook

Matrix Factorization
§  Calculate low rank approximation to obtain latent factors
6
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
2
+ ni yi
2
i
∑
u
∑
#
$
%
&
'
(
ru,i≠0
∑
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Princess

Facebook

§  = rating of user u for item i
§  = latent factors of user u
§  = latent factors of item i
§  = regularization constant
§  = number of rated items for user u
§  = number of ratings for item i
xu
yi
λ
nu
ni
ru,i

Alternating Least Squares
§  Hard to optimize since we have two
variables
§  Fixing one variables gives quadratic
problem
7
X,Y
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
We
only
need
the
item

vectors
rated
by
user
u

ALS Algorithm
§  Update user matrix
8
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep
ﬁxed

Calculate
update

ALS Algorithm contd.
§  Update item matrix
9
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep
ﬁxed

Calculate
update

ALS Algorithm contd.
§  Repeat update step until convergence
10
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Keep
ﬁxed

Calculate
update

What is Apache Flink?
Apache
Flink
deep-‐dive
by

Stephan
Ewen

Tomorrow,
12:20
–
13:00
on

Stage
3

12
Gelly
Table
FlinkML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime

Why Using Flink for ALS?
§  Expressive API
§  Pipelined stream processor
§  Closed loop iterations
§  Operations on managed memory
13

Expressive APIs
§  DataSet: Abstraction for distributed data
§  Computation speciﬁed as sequence of lazily
evaluated transformations
14
case
class
Word(word:
String,
frequency:
Int)

val
lines:
DataSet[String]
=
env.readTextFile(…)

lines.flatMap(line
=>
line.split(“
“).map(word
=>
Word(word,
1))

.groupBy(“word”).sum(“frequency”)

.print()

Program Execution
15
case
class
Path
(from:
Long,
to:
Long)

val
tc
=
edges.iterate(10)
{

paths:
DataSet[Path]
=>

val
next
=
paths

.join(edges)

.where("to")

.equalTo("from")
{

(path,
edge)
=>

Path(path.from,
edge.to)

}

.union(paths)

.distinct()

next

}

Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results

Pipelined Stream Processor
16
Avoiding
materializaBon
of

intermediate
results

Iterate by looping
§  for/while loop in client submits one job per
iteration step
§  Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
17

ALS implementations with Apache
Flink
21

Naïve Implementation
1.  Join item vectors
with ratings
2.  Group on user ID
3.  Compute new
user vectors
22
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T

Pros and Cons of Naïve ALS
§  Pros
•  Easy to implement
§  Cons
•  Item vectors are sent redundantly to network
nodes
•  Two shufﬂe steps make execution expensive
23

Blocked ALS Implementation
1.  Create user and item
rating blocks
2.  Cache them on
worker nodes
3.  Send all item vectors
needed by user
rating block
bundled
4.  Compute block of
item vectors
24
Based
on
Spark’s
MLlib
implementaBon

Pros and Cons of Blocked ALS
§  Pros
•  Reduces network load by avoiding data
duplication
•  Caching ratings: Only one shufﬂe step needed
§  Cons
•  Duplicates the rating matrix (user block/item
block partitioning)
25

Performance Comparison
26
•  40
node
GCE
cluster,

highmem-‐8

•  10
ALS
iteraBons
with
50

latent
factors

•  RaBng
matrix
has
28
billion

non
zero
entries:
Scale
of

NeAlix
or
SpoCfy

Machine Learning with FlinkML
§  FlinkML contains
blocked ALS
§  Support for many
other tasks
•  Clustering
•  Regression
•  Classiﬁcation
§  scikit-learn like
pipeline support
27
val
als
=
ALS()

val
ratingDS
=

env.readCsvFile[(Int,
Int,
Double)](ratingData)

val
parameters
=
ParameterMap()

.add(ALS.Iterations,
10)

.add(ALS.NumFactors,
50)

.add(ALS.Lambda,
1.5)

als.fit(ratingDS,
parameters)

val
testingDS
=
env.readCsvFile[(Int,
Int)](testingData)

val
predictions
=
als.predict(testingDS)

What Have You Seen?
§  How to use collaborative ﬁltering to make
recommendations
§  Apache Flink, a powerful parallel stream
processing engine
§  How to use Apache Flink and alternating least
squares to factorize really large matrices
29

ﬂink.apache.org
@ApacheFlink

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

More Related Content

What's hot

Similar to Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

More from Till Rohrmann

Recently uploaded

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015