Warsaw Data Science - Factorization Machines Introduction

Factorization Machines
- Introduction
Bartłomiej Twardowski
18.10.2016
Warsaw Data Science Meetup

Polish English?
• Support Vector Machines
=> “maszyna wektorów
nośnych”
• Matrix Factorization =>
“faktoryzacja macierzy”
• Factorization Machines =>
“maszyna faktoryzująca”?
• LMGTFY:-) Let’s stick to the
English name then!

Motivation
• one of the most successful model with a great of
expressiveness
• great for begin with context-aware recommendations
• considered as base toolbox for advertisers/kagglers
• FFM presentation from many years ago was on
RecSys 2016 ( still, almost nothing new in it :-( )
• considered it as a fun and original subject for meetup

2015.10.6 - meetup about recommender systems

Not motivated enough?
Success stories.
2. more often appears in DS job offers
1. competitions

Factorization Machines
• S. Rendle 2010 [1]
• combines advantages os Support Vector
Machines(SVM) with factorization models
• generic (real-value features)
• incredible good for sparse data
• model expressiveness

MF - quick recap
Simplest problem formulation[3]:
• U - user set, I - item set
• matrix contains user ratings
• ﬁnd the best representation in k dimensional latent space for
user P (|U| × k) and items Q (|I| × k) so the matrix Rˆ is deﬁned as:  
• to predict rating:
R 2 R|U|⇥|I|

MF - quick recap
with regularization[4]:

Linear & Poly2 models
ˆy(x) = w0 +
nX
i=1
wixi +
nX
i=1
nX
j=i+1
vi,jxixj
ˆy(x) = w0 +
nX
i=1
wixi
simple linear regression model:
adding two-way interactions:

FM Model
for two-way interactions:
model parameters:
For each xi we have dedicated vector vi with k-features.
Then instead of weight wij for feature interactions we have
dot product:

Wait, it’s O(kn2
)! Not linear!

Making it O(kn)
2
6
6
6
6
6
4
x11 x12 x13 . . . x1n
x21 x22 x23 . . . x2n
x31 x32 x33 . . . x3n
...
...
...
...
...
xd1 xd2 xd3 . . . xdn
3
7
7
7
7
7
5

Simpliﬁed version
for k = 1, n =2 perspective
(a + b)2
= a2
+ 2ab + b2
ab =
1
2
(a + b)2
a2
b2
let:
v1x1 = a, v2x2 = b
then:
And now it looks very familiar :-)

FM vs SVM
• FM combines the advantages of SVM and factorization
models
• general prediction working on real-values (like SVM)
• good estimates interactions model with huge sparsity,
where SVM fail (e.g. recommender systems)
• model equation of FMs can be calculated in linear time
• comparable to a polynomial kernel in SVM, but works
for very spars data and works fast.

Use case: Context-Aware
Recommender Systems
• U = {Alice (A),Bob (B),Charlie (C), . . .}
• I = {Titanic (TI),Notting Hill (NH), Star Wars (SW),
Star Trek (ST), . . .}
• S = {(A,TI, 2010-1, 5), (A,NH, 2010-2, 3), (A, SW,
2010-4, 1),(B, SW, 2009-5, 4), (B, ST, 2009-8, 5),
(C,TI, 2009-9, 1), (C, SW, 2009-12, 5)}
• Example from [1]

Example of input data
preparation

Why us FM for this?
The drawback of tensor factorization models and
even more for specialized factorization models is
that [1]:
(1) they are not applicable to standard prediction
data (e.g. a real valued feature vector)
(2) that specialized models are usually derived
individually for a speciﬁc task requiring effort in
modeling and design of a learning algorithm.

How about ranking?
Go for pairwise approach!
http://www.tongji.edu.cn/~qiliu/lor_vs.html

FM ~ MF
given
the model will then mimic a biased MF:

MF ~ PITF
given user x item x tag interactions as:
FM will mimic a pairwise interaction
tensor factorization model (PITF) [7]:

And others
(e.g. factorized NN, KNN++, SVD++, …)
presented in [2].

Field-aware FM
• Have been used to win two CTR competitions [5].
• Introducing grouped features - fields, eg. user,
color, time.
• Learn a different set of latent factors for every pair
of fields
where f(i) is the field of a feature i.
ˆy(x) = w0 +
nX
i=1
wixi +
nX
i=1
nX
j=i+1
hvi,f(j), vj,f(i)ixixj

Available implementations
• libfm (http://www.libfm.org/), SGD/ALS/MCMC
• FM for Julia (https://github.com/btwardow/
FactorizationMachines.jl)
• fastFM (https://github.com/ibayer/fastFM)
• DiFacto (https://github.com/dmlc/difacto)
• lightfm
• spark-libFM, libffm

My experiments with FM on GPU
The same implementation moved from numpy to Theano was
~7x faster! Without using any special GPU tricks.

Going for click prediction?
• feature engineering (counting features, like hist. ctr)
• hashing trick
• L1, FTRL using e.g. vw
• making new features - e.g. decision tree encoding

References
[1] Rendle, Steffen. "Factorization machines." 2010 IEEE International
Conference on Data Mining. IEEE, 2010.
[2] Rendle, Steffen. "Factorization machines with libfm." ACM
Transactions on Intelligent Systems and Technology (TIST) 3.3 (2012): 57.
[3] Takács, Gábor, et al. "Matrix factorization and neighbor based
algorithms for the netﬂix prize problem." Proceedings of the 2008 ACM
conference on Recommender systems. ACM, 2008.
[4] Paterek, Arkadiusz. "Improving regularized singular value
decomposition for collaborative ﬁltering." Proceedings of KDD cup and
workshop. Vol. 2007. 2007.

References
[5] http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
[6] SREBRO,N., RENNIE,J. D. M., AND JAAKOLA, T. S. 2005.
Maximum-margin matrix factorization. In Advances in Neural
Information Processing Systems 17,MIT 1329–1336.
[7] RENDLE,S. AND SCHMIDT-THIEME, L. 2010. Pairwise interaction
tensor factorization for personalized tag recommendation. In
Proceedings of the third ACM International Conference on Web
Search and Data Mining (WSDM’10). ACM, New York, NY, 81–90.

Q&A
@btwardow, Bartłomiej Twardowski
B.Twardowski@ii.pw.edu.pl

Warsaw Data Science - Factorization Machines Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Warsaw Data Science - Factorization Machines Introduction

Similar to Warsaw Data Science - Factorization Machines Introduction (20)

Recently uploaded

Recently uploaded (20)

Warsaw Data Science - Factorization Machines Introduction