Dsgld

Large-Scale Distributed Bayesian Matrix
Factorization using Stochastic Gradient MCMC
孙佩源
2016年1月6日

accepted in KDD ’15
currently a postdoctoral
fellow working with
Yoshua Bengio
UCIrvine
Sungjin Ahn
Google
Anoop Korattikara
University of Amsterdam
Max Welling

2008
Bayesian PMF using MCMC
2011
SGLD
2015
Distributed BPMF using SGLD

outline
• Introduction
• Bayesian PMF
• SGLD
• Distributed BPMF using SGLD
• Reference

Introduction
推荐系统
音乐推荐商品推荐

Introduction
Netflix电影评分预测竞赛

Bayesian PMF
• PMF ( Probabilistic Matrix Factorization )
considered as a generative process
• Pick user u latent factor:
• Pick movie v latent factor:
• For each (user, movie) pair observed:
pick rating as:
1 2
{ , , , }ku u u uL L L L L
1 2
{ , , , }kv v v vR R R R L
*u vL R noise

Bayesian PMF
• PMF graphical model

Bayesian PMF
• PMF learning
MAP:
equivalent to:

Bayesian PMF
• PMF learning
• solution can be found by gradient descent in U and V
• need manual control on the parameters to avoid overfitting

Bayesian PMF
• Bayesian PMF Model
• introduce priors for the parameters
• allow model complexity to be controlled automatically

Bayesian PMF
• Bayesian PMF prediction
• predictive distribution:
• in contrast with MAP estimation in PMF:
*
( |{ , } ( ))ij MAPp R U V R
Integrate over uncertainty
in model parameters

Bayesian PMF
• Bayesian PMF evaluation
• predictive distribution:
• MCMC method: generated by a Markov Chain
whose stationary distribution is the posterior
distribution over the model parameters
0
0
( , , , | , )
( , | , , ) ( , | )
U V
U V U V
p U V R
p U V R p
  
     

Bayesian PMF
• Bayesian PMF inference algorithm
• Gibbs sampling algorithm
=
=
0
0
( | , , , )
( | ) ( | , , , )
i U
i U ij i j V
p U R V
p U p R U V
 
   
0( | , , , )i Vp V R U  
L
矩阵求逆复杂度为 3
( )O D

drawback of MCMC
• each iteration of MCMC requires computations over the
whole dataset
• each round of sampling requires expensive computation

Stochastic Gradient Langevin Dynamics
• stochastic optimization on posterior distribution
to find the MAP parameters operate as follows:
general idea is: 使用数据子集计算梯度近似代替全局梯度
prior distribution likehood 数据子集step size

• MCMC using Langevin Dynamics
Langevin Diffusion:
的稳态分布为
代入后验分布并使用Euler-Maruyama离散化方法可得：
1
log ( )
2
t t td dt dw      ( ) 
降低step size可以显著降低离散误差

• the two algorithms looks very similar
• combining these two ideas

• based on rigorous proof by [Qi He, Jack Xin 2012]:
when , the sequence generated will converge to the true
posterior distribution.
t  

• the algorithm in SG or LD depends on:
whether the SG noise or LD noise dominates the stochasticity
• when is large: SG noise dominates
• when is small: LD noise dominates
t
t

• we will only focus on the sampling from:
• Remember from previous slides, we have:
( , | , )p U V R 

• suppose the rating matrix represented as:
• computing the gradient of the log-posterior w.r.t. U is:

• we were to update only the parameter of users who have ratings
in the mini-batch data
• we find an unbiased estimate of this gradient which need only
update users in the mini-batch:
这样就可以在使用分块数据更新参数时只计算本块内的用户参数

• intuitive explanation to the approach
movies
users
1 3 5
2 1 3
2 2 1
2 4 2
3 2 2
3 3 4
4 1 2
4 4 1
5 1 4
5 3 5
6 2 1
6 4 3

• run two chains in parallel with parameters:
• assume the latent features with dimension 2:
1 1
1 1 1{ , }U V 
2 2
2 1 1{ , }U V 
61111
1
6212
, ,
T
UU
U
UU
    
    
     
L
11 141
1
21 24
, ,
V U
V
V U
     
     
     
L

• divide the rating matrix into 4 blocks
the gray blocks form the 1st group
the white blocks form the 2nd group
• start 4 workers corresponding to 4 blocks respectively
worker1
worker2
worker3
worker4 work1 and work2 share 1
work3 and work4 share 2

• the rating data is partitioned into 4 blocks
2 1 3
2 2 1
3 2 2
1 3 5
2 4 2
3 3 4
4 1 2
5 1 4
6 2 1
4 4 1
5 3 5
6 4 3
worker1
worker3
worker4
worker2

• each worker (e.g. worker1) works as:
1. sample mini-batch from work1’s rating data (assume size is 2):
M=
2. for each user i and j in M updates in parallel using following rules:
2 1 3
2 2 1
3 2 2
worker1
2 1 3
2 2 1

• each worker (e.g. worker1) works as:
1. worker1 updates U2, U3, V1, V2 using mini-batch data M
2. similarity, worker2 updates U4, U5, U6, V3, V4 using its mini-batch data
2 1 3
2 2 1
3 2 2
worker1

• experiments compared five different algorithms:
• dataset:

• results:
Netflix dataset Yahoo music dataset

Reference
• Bayesian Proabilistci Matrix Factorization using Markov Chain Monte Carlo. Ruslan Salakhutdinov,
Andriy Mnih. University of Toronto. ICML 2008.
• Probabilistic Matrix Factorization. Emily Fox. University of Washington. Machine Learning for Big Data,
2014.
• Bayesian Learning via Stochastic Gradient Langevin Dynamics. Max Welling. UCIrvine, ICML 2011.
• Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC. sunjin Ahn etc.
KDD 2015.
• Bayesian Posterior Inference in Big Data Arena. Max Welling. ICML 2014 tutorial.
• Hybrid Deterministic-stochastic gradient langevin dynamics for Bayesian learning. Qi He, Jack Xin.
Communications in Information and Systems 2012.

Dsgld

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dsgld

Similar to Dsgld (20)

More from sun peiyuan

More from sun peiyuan (8)

Recently uploaded

Recently uploaded (20)

Dsgld