Factorization Machines
- Introduction
Bartłomiej Twardowski
18.10.2016
Warsaw Data Science Meetup
Polish English?
• Support Vector Machines
=> “maszyna wektorów
nośnych”
• Matrix Factorization =>
“faktoryzacja macierzy”
• Factorization Machines =>
“maszyna faktoryzująca”?
• LMGTFY:-) Let’s stick to the
English name then!
Motivation
• one of the most successful model with a great of
expressiveness
• great for begin with context-aware recommendations
• considered as base toolbox for advertisers/kagglers
• FFM presentation from many years ago was on
RecSys 2016 ( still, almost nothing new in it :-( )
• considered it as a fun and original subject for meetup
2015.10.6 - meetup about recommender systems
Not motivated enough?
Success stories.
2. more often appears in DS job offers
1. competitions
Factorization Machines
• S. Rendle 2010 [1]
• combines advantages os Support Vector
Machines(SVM) with factorization models
• generic (real-value features)
• incredible good for sparse data
• model expressiveness
MF - quick recap
Simplest problem formulation[3]:
• U - user set, I - item set
• matrix contains user ratings
• find the best representation in k dimensional latent space for
user P (|U| × k) and items Q (|I| × k) so the matrix Rˆ is defined as: 

• to predict rating:
R 2 R|U|⇥|I|
MF - quick recap
with regularization[4]:
Linear & Poly2 models
ˆy(x) = w0 +
nX
i=1
wixi +
nX
i=1
nX
j=i+1
vi,jxixj
ˆy(x) = w0 +
nX
i=1
wixi
simple linear regression model:
adding two-way interactions:
FM Model
for two-way interactions:
model parameters:
For each xi we have dedicated vector vi with k-features.
Then instead of weight wij for feature interactions we have
dot product:
Wait, it’s O(kn2
)! Not linear!
Making it O(kn)
2
6
6
6
6
6
4
x11 x12 x13 . . . x1n
x21 x22 x23 . . . x2n
x31 x32 x33 . . . x3n
...
...
...
...
...
xd1 xd2 xd3 . . . xdn
3
7
7
7
7
7
5
Simplified version
for k = 1, n =2 perspective
(a + b)2
= a2
+ 2ab + b2
ab =
1
2
(a + b)2
a2
b2
let:
v1x1 = a, v2x2 = b
then:
And now it looks very familiar :-)
FM vs SVM
• FM combines the advantages of SVM and factorization
models
• general prediction working on real-values (like SVM)
• good estimates interactions model with huge sparsity,
where SVM fail (e.g. recommender systems)
• model equation of FMs can be calculated in linear time
• comparable to a polynomial kernel in SVM, but works
for very spars data and works fast.
Use case: Context-Aware
Recommender Systems
• U = {Alice (A),Bob (B),Charlie (C), . . .}
• I = {Titanic (TI),Notting Hill (NH), Star Wars (SW),
Star Trek (ST), . . .}
• S = {(A,TI, 2010-1, 5), (A,NH, 2010-2, 3), (A, SW,
2010-4, 1),(B, SW, 2009-5, 4), (B, ST, 2009-8, 5),
(C,TI, 2009-9, 1), (C, SW, 2009-12, 5)}
• Example from [1]
Example of input data
preparation
Why us FM for this?
The drawback of tensor factorization models and
even more for specialized factorization models is
that [1]:
(1) they are not applicable to standard prediction
data (e.g. a real valued feature vector)
(2) that specialized models are usually derived
individually for a specific task requiring effort in
modeling and design of a learning algorithm.
How about ranking?
Go for pairwise approach!
http://www.tongji.edu.cn/~qiliu/lor_vs.html
Model expressiveness
FM ~ MF
given
the model will then mimic a biased MF:
MF ~ PITF
given user x item x tag interactions as:
FM will mimic a pairwise interaction
tensor factorization model (PITF) [7]:
And others
(e.g. factorized NN, KNN++, SVD++, …)
presented in [2].
Field-aware FM
• Have been used to win two CTR competitions [5].
• Introducing grouped features - fields, eg. user,
color, time.
• Learn a different set of latent factors for every pair
of fields
where f(i) is the field of a feature i.
ˆy(x) = w0 +
nX
i=1
wixi +
nX
i=1
nX
j=i+1
hvi,f(j), vj,f(i)ixixj
Available implementations
• libfm (http://www.libfm.org/), SGD/ALS/MCMC
• FM for Julia (https://github.com/btwardow/
FactorizationMachines.jl)
• fastFM (https://github.com/ibayer/fastFM)
• DiFacto (https://github.com/dmlc/difacto)
• lightfm
• spark-libFM, libffm
My experiments with FM on GPU
The same implementation moved from numpy to Theano was
~7x faster! Without using any special GPU tricks.
Going for click prediction?
• feature engineering (counting features, like hist. ctr)
• hashing trick
• L1, FTRL using e.g. vw
• making new features - e.g. decision tree encoding
How about now? :-)
References
[1] Rendle, Steffen. "Factorization machines." 2010 IEEE International
Conference on Data Mining. IEEE, 2010.
[2] Rendle, Steffen. "Factorization machines with libfm." ACM
Transactions on Intelligent Systems and Technology (TIST) 3.3 (2012): 57.
[3] Takács, Gábor, et al. "Matrix factorization and neighbor based
algorithms for the netflix prize problem." Proceedings of the 2008 ACM
conference on Recommender systems. ACM, 2008.
[4] Paterek, Arkadiusz. "Improving regularized singular value
decomposition for collaborative filtering." Proceedings of KDD cup and
workshop. Vol. 2007. 2007.
References
[5] http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
[6] SREBRO,N., RENNIE,J. D. M., AND JAAKOLA, T. S. 2005.
Maximum-margin matrix factorization. In Advances in Neural
Information Processing Systems 17,MIT 1329–1336.
[7] RENDLE,S. AND SCHMIDT-THIEME, L. 2010. Pairwise interaction
tensor factorization for personalized tag recommendation. In
Proceedings of the third ACM International Conference on Web
Search and Data Mining (WSDM’10). ACM, New York, NY, 81–90.
Q&A
@btwardow, Bartłomiej Twardowski
B.Twardowski@ii.pw.edu.pl

Warsaw Data Science - Factorization Machines Introduction

  • 1.
    Factorization Machines - Introduction BartłomiejTwardowski 18.10.2016 Warsaw Data Science Meetup
  • 2.
    Polish English? • SupportVector Machines => “maszyna wektorów nośnych” • Matrix Factorization => “faktoryzacja macierzy” • Factorization Machines => “maszyna faktoryzująca”? • LMGTFY:-) Let’s stick to the English name then!
  • 3.
    Motivation • one ofthe most successful model with a great of expressiveness • great for begin with context-aware recommendations • considered as base toolbox for advertisers/kagglers • FFM presentation from many years ago was on RecSys 2016 ( still, almost nothing new in it :-( ) • considered it as a fun and original subject for meetup
  • 4.
    2015.10.6 - meetupabout recommender systems
  • 5.
    Not motivated enough? Successstories. 2. more often appears in DS job offers 1. competitions
  • 6.
    Factorization Machines • S.Rendle 2010 [1] • combines advantages os Support Vector Machines(SVM) with factorization models • generic (real-value features) • incredible good for sparse data • model expressiveness
  • 7.
    MF - quickrecap Simplest problem formulation[3]: • U - user set, I - item set • matrix contains user ratings • find the best representation in k dimensional latent space for user P (|U| × k) and items Q (|I| × k) so the matrix Rˆ is defined as: 
 • to predict rating: R 2 R|U|⇥|I|
  • 8.
    MF - quickrecap with regularization[4]:
  • 9.
    Linear & Poly2models ˆy(x) = w0 + nX i=1 wixi + nX i=1 nX j=i+1 vi,jxixj ˆy(x) = w0 + nX i=1 wixi simple linear regression model: adding two-way interactions:
  • 10.
    FM Model for two-wayinteractions: model parameters: For each xi we have dedicated vector vi with k-features. Then instead of weight wij for feature interactions we have dot product:
  • 11.
  • 12.
    Making it O(kn) 2 6 6 6 6 6 4 x11x12 x13 . . . x1n x21 x22 x23 . . . x2n x31 x32 x33 . . . x3n ... ... ... ... ... xd1 xd2 xd3 . . . xdn 3 7 7 7 7 7 5
  • 13.
    Simplified version for k= 1, n =2 perspective (a + b)2 = a2 + 2ab + b2 ab = 1 2 (a + b)2 a2 b2 let: v1x1 = a, v2x2 = b then: And now it looks very familiar :-)
  • 14.
    FM vs SVM •FM combines the advantages of SVM and factorization models • general prediction working on real-values (like SVM) • good estimates interactions model with huge sparsity, where SVM fail (e.g. recommender systems) • model equation of FMs can be calculated in linear time • comparable to a polynomial kernel in SVM, but works for very spars data and works fast.
  • 15.
    Use case: Context-Aware RecommenderSystems • U = {Alice (A),Bob (B),Charlie (C), . . .} • I = {Titanic (TI),Notting Hill (NH), Star Wars (SW), Star Trek (ST), . . .} • S = {(A,TI, 2010-1, 5), (A,NH, 2010-2, 3), (A, SW, 2010-4, 1),(B, SW, 2009-5, 4), (B, ST, 2009-8, 5), (C,TI, 2009-9, 1), (C, SW, 2009-12, 5)} • Example from [1]
  • 16.
    Example of inputdata preparation
  • 17.
    Why us FMfor this? The drawback of tensor factorization models and even more for specialized factorization models is that [1]: (1) they are not applicable to standard prediction data (e.g. a real valued feature vector) (2) that specialized models are usually derived individually for a specific task requiring effort in modeling and design of a learning algorithm.
  • 18.
    How about ranking? Gofor pairwise approach! http://www.tongji.edu.cn/~qiliu/lor_vs.html
  • 19.
  • 20.
    FM ~ MF given themodel will then mimic a biased MF:
  • 21.
    MF ~ PITF givenuser x item x tag interactions as: FM will mimic a pairwise interaction tensor factorization model (PITF) [7]:
  • 22.
    And others (e.g. factorizedNN, KNN++, SVD++, …) presented in [2].
  • 23.
    Field-aware FM • Havebeen used to win two CTR competitions [5]. • Introducing grouped features - fields, eg. user, color, time. • Learn a different set of latent factors for every pair of fields where f(i) is the field of a feature i. ˆy(x) = w0 + nX i=1 wixi + nX i=1 nX j=i+1 hvi,f(j), vj,f(i)ixixj
  • 24.
    Available implementations • libfm(http://www.libfm.org/), SGD/ALS/MCMC • FM for Julia (https://github.com/btwardow/ FactorizationMachines.jl) • fastFM (https://github.com/ibayer/fastFM) • DiFacto (https://github.com/dmlc/difacto) • lightfm • spark-libFM, libffm
  • 25.
    My experiments withFM on GPU The same implementation moved from numpy to Theano was ~7x faster! Without using any special GPU tricks.
  • 26.
    Going for clickprediction? • feature engineering (counting features, like hist. ctr) • hashing trick • L1, FTRL using e.g. vw • making new features - e.g. decision tree encoding
  • 27.
  • 28.
    References [1] Rendle, Steffen."Factorization machines." 2010 IEEE International Conference on Data Mining. IEEE, 2010. [2] Rendle, Steffen. "Factorization machines with libfm." ACM Transactions on Intelligent Systems and Technology (TIST) 3.3 (2012): 57. [3] Takács, Gábor, et al. "Matrix factorization and neighbor based algorithms for the netflix prize problem." Proceedings of the 2008 ACM conference on Recommender systems. ACM, 2008. [4] Paterek, Arkadiusz. "Improving regularized singular value decomposition for collaborative filtering." Proceedings of KDD cup and workshop. Vol. 2007. 2007.
  • 29.
    References [5] http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf [6] SREBRO,N.,RENNIE,J. D. M., AND JAAKOLA, T. S. 2005. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17,MIT 1329–1336. [7] RENDLE,S. AND SCHMIDT-THIEME, L. 2010. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM International Conference on Web Search and Data Mining (WSDM’10). ACM, New York, NY, 81–90.
  • 30.