2. accepted in KDD ’15
currently a postdoctoral
fellow working with
Yoshua Bengio
UCIrvine
Sungjin Ahn
Google
Anoop Korattikara
University of Amsterdam
Max Welling
7. Bayesian PMF
• PMF ( Probabilistic Matrix Factorization )
considered as a generative process
• Pick user u latent factor:
• Pick movie v latent factor:
• For each (user, movie) pair observed:
pick rating as:
1 2
{ , , , }ku u u uL L L L L
1 2
{ , , , }kv v v vR R R R L
*u vL R noise
10. Bayesian PMF
• PMF learning
• solution can be found by gradient descent in U and V
• need manual control on the parameters to avoid overfitting
11. Bayesian PMF
• Bayesian PMF Model
• introduce priors for the parameters
• allow model complexity to be controlled automatically
12. Bayesian PMF
• Bayesian PMF prediction
• predictive distribution:
• in contrast with MAP estimation in PMF:
*
( |{ , } ( ))ij MAPp R U V R
Integrate over uncertainty
in model parameters
13. Bayesian PMF
• Bayesian PMF evaluation
• predictive distribution:
• MCMC method: generated by a Markov Chain
whose stationary distribution is the posterior
distribution over the model parameters
0
0
( , , , | , )
( , | , , ) ( , | )
U V
U V U V
p U V R
p U V R p
14. Bayesian PMF
• Bayesian PMF inference algorithm
• Gibbs sampling algorithm
=
=
0
0
( | , , , )
( | ) ( | , , , )
i U
i U ij i j V
p U R V
p U p R U V
0( | , , , )i Vp V R U
L
矩阵求逆复杂度为 3
( )O D
15. drawback of MCMC
• each iteration of MCMC requires computations over the
whole dataset
• each round of sampling requires expensive computation
16. Stochastic Gradient Langevin Dynamics
• stochastic optimization on posterior distribution
to find the MAP parameters operate as follows:
general idea is: 使用数据子集计算梯度近似代替全局梯度
prior distribution likehood 数据子集step size
19. Stochastic Gradient Langevin Dynamics
• based on rigorous proof by [Qi He, Jack Xin 2012]:
when , the sequence generated will converge to the true
posterior distribution.
t
20. Stochastic Gradient Langevin Dynamics
• the algorithm in SG or LD depends on:
whether the SG noise or LD noise dominates the stochasticity
• when is large: SG noise dominates
• when is small: LD noise dominates
t
t
21. Distributed BPMF using SGLD
• we will only focus on the sampling from:
• Remember from previous slides, we have:
( , | , )p U V R
22. Distributed BPMF using SGLD
• suppose the rating matrix represented as:
• computing the gradient of the log-posterior w.r.t. U is:
23. Distributed BPMF using SGLD
• we were to update only the parameter of users who have ratings
in the mini-batch data
• we find an unbiased estimate of this gradient which need only
update users in the mini-batch:
这样就可以在使用分块数据更新参数时只计算本块内的用户参数
25. Distributed BPMF using SGLD
• run two chains in parallel with parameters:
• assume the latent features with dimension 2:
1 1
1 1 1{ , }U V
2 2
2 1 1{ , }U V
61111
1
6212
, ,
T
UU
U
UU
L
11 141
1
21 24
, ,
V U
V
V U
L
26. Distributed BPMF using SGLD
• divide the rating matrix into 4 blocks
the gray blocks form the 1st group
the white blocks form the 2nd group
• start 4 workers corresponding to 4 blocks respectively
worker1
worker2
worker3
worker4 work1 and work2 share 1
work3 and work4 share 2
28. Distributed BPMF using SGLD
• each worker (e.g. worker1) works as:
1. sample mini-batch from work1’s rating data (assume size is 2):
M=
2. for each user i and j in M updates in parallel using following rules:
2 1 3
2 2 1
3 2 2
worker1
2 1 3
2 2 1
29. Distributed BPMF using SGLD
• each worker (e.g. worker1) works as:
1. worker1 updates U2, U3, V1, V2 using mini-batch data M
2. similarity, worker2 updates U4, U5, U6, V3, V4 using its mini-batch data
2 1 3
2 2 1
3 2 2
worker1
30. Distributed BPMF using SGLD
• experiments compared five different algorithms:
• dataset:
32. Reference
• Bayesian Proabilistci Matrix Factorization using Markov Chain Monte Carlo. Ruslan Salakhutdinov,
Andriy Mnih. University of Toronto. ICML 2008.
• Probabilistic Matrix Factorization. Emily Fox. University of Washington. Machine Learning for Big Data,
2014.
• Bayesian Learning via Stochastic Gradient Langevin Dynamics. Max Welling. UCIrvine, ICML 2011.
• Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC. sunjin Ahn etc.
KDD 2015.
• Bayesian Posterior Inference in Big Data Arena. Max Welling. ICML 2014 tutorial.
• Hybrid Deterministic-stochastic gradient langevin dynamics for Bayesian learning. Qi He, Jack Xin.
Communications in Information and Systems 2012.