Bayesian Dark Knowledge and Matrix Factorization

•

0 likes•3,728 views

Preferred Networks

Masatoshi Uehara, PFN Summer Internship 2016

Technology

Bayesian Dark Knowledge and Matrix Factorization
Masatoshi Uehara
Mentor: Oono Kenta, Brian Vogel
October 27, 2016

Contents
1 Introduction
2 Bayesian Dark Knowledge with various SG-MCMC methods
3 Matrix Factorization
(JPN) Masatoshi October 27, 2016 2 / 18

Introduction
Introduction
SG-MCMC is a sampling algorithm towards large data.
We apply a variety of SG-MCMC methods to Bayesian Dark
Knowledge.
We combine GANs with Bayesian Dark Knowledge.
We apply SG-MCMC and neural networks to matrix factorization.
(JPN) Masatoshi October 27, 2016 3 / 18

Introduction
SGLD
SGLD
SGLD is a method combining with SGD and MLA(a sampling
algorithm)
θt+1 ← θt − tD ˜U(θt) + N(0, 2 D)
In the case of Bayesian Neural Network, the formula is as follows:
∆θt =
t
2
log p(θt) +
N
n
log p(yti |xti , θt) + ηt, ηt ∼ N(0, t).
Note that the noise term is removed in SGD.
(JPN) Masatoshi October 27, 2016 4 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Bayesian Dark Knowledge Overview
Overview
Bayesian Dark knowledge is a method of combining SGLD with the
concept of distillation.
SGLD is a useful method for learning Bayeisian Deep Networks.
The problem is that SGLD needs to archive many copies of
parameters.
The motivation is replacing a set of neural networks with a single
deep network.
We can estimate the conﬁdence rate even if data number is small.
(JPN) Masatoshi October 27, 2016 5 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Method
Teacher networks is denoted as p(y|x, DN).
Student network is denoted as S(y|x, ω).
In the distillation phase, the following
equation is minimized.
Distillation loss
L(ω) = p(ω|x)p(x)
≈
1
Θ
1
D
θ∈Θ x ∈D
p(y|x , θ)[S(y|x , ω)]dx
(JPN) Masatoshi October 27, 2016 6 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Algorithm
Algorithm
Note that the student network is trained online. We do not have to
archive many copies of parameters.
(JPN) Masatoshi October 27, 2016 7 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
How to improve?
We want to make a variety of teachers.
Use other SG-MCMC methods.
How to make unlabeled data set?
Use GANs.
(JPN) Masatoshi October 27, 2016 8 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
SG-HMC and SG-NHT
SG-HMC
θt+1 ← θt + M−1
rt
rt+1 ← rt − t
˜U(θt) − tCM−1
rt + N(0, t(2C − tBt))
SG-NHT
θt+1 ← θt + rt
rt+1 ← rt − t
˜U(θt) − tζtrt + N(0, t(2C − tBt))
ζt+1 ← ζt + (
1
d
rT
t rt − 1)
(JPN) Masatoshi October 27, 2016 9 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Bayesian Dark Knowledge with GANs
GANs can mimic the empirical
distribution.
In the distillation phase, we use GANs
as a simulator.
How to remove poor images....
(JPN) Masatoshi October 27, 2016 10 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Anormaly detection by GANs
uLSIF
GAN
(JPN) Masatoshi October 27, 2016 11 / 18

Bayesian Dark Knowledge with various SG-MCMC methods
Result : MNIST
Setting: 800 labeled samples in MNIST, Epoch: 2000, Burn-in
intervals:200, Thinning intervals:5.
Network 784-1200-1200-10, Activation: Relu
Result
(JPN) Masatoshi October 27, 2016 12 / 18

Matrix Factorization
Matrix Factorization
Rating matrix is given.
ui ....user feature, vj ...item
feature , Rij ... rating matrix.
When learning, use SGD.
ui+1 ← ui − ui [(Ri,j − uT
i vj )2
+ λu2
i ]
vj+1 ← vj − vj [(Ri,j − uT
i vj )2
+ λv2
j ]
(JPN) Masatoshi October 27, 2016 13 / 18

Matrix Factorization
Matrix Factorization with SGLD
p(R|U, V , τ) =
L
i=1
M
j=1
[N(Rij |UT
i Vj , τ−1
]Iij
p(U|λU) =
L
i=1
N(Ui |0, λ−1
U )
p(V |λV ) =
M
j=1
N(Vj |0, λ−1
V )
λUd
∼ Gamma(α0, β0)
λVd
∼ Gamma(α0, β0)
Use Gibbs Sampling.
When updating u and v, SGLD
is used.
λ is automatically tuned.
(JPN) Masatoshi October 27, 2016 14 / 18

Matrix Factorization
Neural Network Matrix Factorization
Estimate Xn,m by the equation: ˆXn,m = fθ(Un, Vm)
Cost function:
(Xn,m − ˆXn,m)2
+ λ[ Un
2
2 + Vm
2
2]
Update θ, un, vm at the same time
NNMF can reach state of the art accuracy....
(JPN) Masatoshi October 27, 2016 15 / 18

Matrix Factorization
Results
Use ML-100K, ML-1M data set.
Evaluate by root mean square method(RMSE).
Unfortunately, state of the art accuracy is not reproduced....
(JPN) Masatoshi October 27, 2016 16 / 18

Matrix Factorization
Discussion
Does data generated by GANs help classiﬁers?
What is a good method of combining Neural Networks with matrix
factorization?
(JPN) Masatoshi October 27, 2016 17 / 18

Matrix Factorization
References Papers
Large-Scale Distributed Bayesian Matrix Factorization using
Stochastic Gradient MCMC
Neural Network Matrix Factorization
A Complete Recipe for Stochastic Gradient MCMC
Bayesian Dark Knowledge
Probabilistic Matrix Factorization
(JPN) Masatoshi October 27, 2016 18 / 18

What's hot

WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...grssieee

SASA 2016Mzabalazo Ngwenya

02 2d systems matrixRumah Belajar

Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Alexander Litvinenko

YSC 2013Adrien Ickowicz

Principal component analysis and matrix factorizations for learning (part 3) ...zukun

Smart Multitask Bregman ClusteringVenkat Sai Sharath Mudhigonda

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata

Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Alexander Litvinenko

Neural Processes FamilyKota Matsui

Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev

Proximal Splitting and Optimal TransportGabriel Peyré

Low Complexity Regularization of Inverse ProblemsGabriel Peyré

Convex Optimization Modelling with CVXOPTandrewmart11

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...The Statistical and Applied Mathematical Sciences Institute

Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante

Matrix Factorizations for Recommender SystemsDmitriy Selivanov

Comparing estimation algorithms for block clustering modelsBigMC

Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré

New Mathematical Tools for the Financial SectorSSA KPI

What's hot (20)

WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...

SASA 2016

02 2d systems matrix

Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...

YSC 2013

Principal component analysis and matrix factorizations for learning (part 3) ...

Smart Multitask Bregman Clustering

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...

Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)

Neural Processes Family

Multiclass Logistic Regression: Derivation and Apache Spark Examples

Proximal Splitting and Optimal Transport

Low Complexity Regularization of Inverse Problems

Convex Optimization Modelling with CVXOPT

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...

Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...

Matrix Factorizations for Recommender Systems

Comparing estimation algorithms for block clustering models

Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...

New Mathematical Tools for the Financial Sector

Similar to Bayesian Dark Knowledge and Matrix Factorization

Multi-Armed Bandit and ApplicationsSangwoo Mo

Ica group 3[1]Apoorva Srinivasan

QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...The Statistical and Applied Mathematical Sciences Institute

Observations on Ternary Quadratic Equation z2 = 82x2 +y2IRJET Journal

Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesMatt Moores

Dimensionality reduction with UMAPJakub Bartczuk

Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa

Simple representations for learning: factorizations and similarities Gael Varoquaux

Matrix Chain Scheduling AlgorithmWen-Shih Chao

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...asahiushio1

Hierarchical matrix techniques for maximum likelihood covariance estimationAlexander Litvinenko

Graph Neural Network in practicetuxette

Linear Machine Learning Models with L2 Regularization and Kernel TricksFengtao Wu

Triggering patterns of topology changes in dynamic attributed graphsINSA Lyon - L'Institut National des Sciences Appliquées de Lyon

Micro to macro passage in traffic models including multi-anticipation effectGuillaume Costeseque

QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...The Statistical and Applied Mathematical Sciences Institute

2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...asahiushio1

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...The Statistical and Applied Mathematical Sciences Institute

Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST

A lattice-based consensus clusteringDmitrii Ignatov

Similar to Bayesian Dark Knowledge and Matrix Factorization (20)

Multi-Armed Bandit and Applications

Ica group 3[1]

QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...

Observations on Ternary Quadratic Equation z2 = 82x2 +y2

Accelerating Pseudo-Marginal MCMC using Gaussian Processes

Dimensionality reduction with UMAP

Asynchronous Stochastic Optimization, New Analysis and Algorithms

Simple representations for learning: factorizations and similarities

Matrix Chain Scheduling Algorithm

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...

Hierarchical matrix techniques for maximum likelihood covariance estimation

Graph Neural Network in practice

Linear Machine Learning Models with L2 Regularization and Kernel Tricks

Triggering patterns of topology changes in dynamic attributed graphs

Micro to macro passage in traffic models including multi-anticipation effect

QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...

2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...

Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...

A lattice-based consensus clustering

Recently uploaded

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

How to convert PDF to text with Nanonetsnaman860154

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Pigging Solutions in Pet Food ManufacturingPigging Solutions

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Install Stable Diffusion in windows machinePadma Pradeep

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

CloudStudio User manual (basic edition):comworks

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation

Advanced Test Driven-Development @ php[tek] 2024

Streamlining Python Development: A Guide to a Modern Project Setup

How to convert PDF to text with Nanonets

Unblocking The Main Thread Solving ANRs and Frozen Frames

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Unleash Your Potential - Namagunga Girls Coding Club

08448380779 Call Girls In Friends Colony Women Seeking Men

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Pigging Solutions in Pet Food Manufacturing

SQL Database Design For Developers at php[tek] 2024

Pigging Solutions Piggable Sweeping Elbows

Install Stable Diffusion in windows machine

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

CloudStudio User manual (basic edition):

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Designing IA for AI - Information Architecture Conference 2024

DMCC Future of Trade Web3 - Special Edition

Bayesian Dark Knowledge and Matrix Factorization

1. Bayesian Dark Knowledge and Matrix Factorization Masatoshi Uehara Mentor: Oono Kenta, Brian Vogel October 27, 2016

2. Contents 1 Introduction 2 Bayesian Dark Knowledge with various SG-MCMC methods 3 Matrix Factorization (JPN) Masatoshi October 27, 2016 2 / 18

3. Introduction Introduction SG-MCMC is a sampling algorithm towards large data. We apply a variety of SG-MCMC methods to Bayesian Dark Knowledge. We combine GANs with Bayesian Dark Knowledge. We apply SG-MCMC and neural networks to matrix factorization. (JPN) Masatoshi October 27, 2016 3 / 18

4. Introduction SGLD SGLD SGLD is a method combining with SGD and MLA(a sampling algorithm) θt+1 ← θt − tD ˜U(θt) + N(0, 2 D) In the case of Bayesian Neural Network, the formula is as follows: ∆θt = t 2 log p(θt) + N n log p(yti |xti , θt) + ηt, ηt ∼ N(0, t). Note that the noise term is removed in SGD. (JPN) Masatoshi October 27, 2016 4 / 18

5. Bayesian Dark Knowledge with various SG-MCMC methods Bayesian Dark Knowledge Overview Overview Bayesian Dark knowledge is a method of combining SGLD with the concept of distillation. SGLD is a useful method for learning Bayeisian Deep Networks. The problem is that SGLD needs to archive many copies of parameters. The motivation is replacing a set of neural networks with a single deep network. We can estimate the conﬁdence rate even if data number is small. (JPN) Masatoshi October 27, 2016 5 / 18

6. Bayesian Dark Knowledge with various SG-MCMC methods Method Teacher networks is denoted as p(y|x, DN). Student network is denoted as S(y|x, ω). In the distillation phase, the following equation is minimized. Distillation loss L(ω) = p(ω|x)p(x) ≈ 1 Θ 1 D θ∈Θ x ∈D p(y|x , θ)[S(y|x , ω)]dx (JPN) Masatoshi October 27, 2016 6 / 18

7. Bayesian Dark Knowledge with various SG-MCMC methods Algorithm Algorithm Note that the student network is trained online. We do not have to archive many copies of parameters. (JPN) Masatoshi October 27, 2016 7 / 18

8. Bayesian Dark Knowledge with various SG-MCMC methods How to improve? We want to make a variety of teachers. Use other SG-MCMC methods. How to make unlabeled data set? Use GANs. (JPN) Masatoshi October 27, 2016 8 / 18

9. Bayesian Dark Knowledge with various SG-MCMC methods SG-HMC and SG-NHT SG-HMC θt+1 ← θt + M−1 rt rt+1 ← rt − t ˜U(θt) − tCM−1 rt + N(0, t(2C − tBt)) SG-NHT θt+1 ← θt + rt rt+1 ← rt − t ˜U(θt) − tζtrt + N(0, t(2C − tBt)) ζt+1 ← ζt + ( 1 d rT t rt − 1) (JPN) Masatoshi October 27, 2016 9 / 18

10. Bayesian Dark Knowledge with various SG-MCMC methods Bayesian Dark Knowledge with GANs GANs can mimic the empirical distribution. In the distillation phase, we use GANs as a simulator. How to remove poor images.... (JPN) Masatoshi October 27, 2016 10 / 18

11. Bayesian Dark Knowledge with various SG-MCMC methods Anormaly detection by GANs uLSIF GAN (JPN) Masatoshi October 27, 2016 11 / 18

12. Bayesian Dark Knowledge with various SG-MCMC methods Result : MNIST Setting: 800 labeled samples in MNIST, Epoch: 2000, Burn-in intervals:200, Thinning intervals:5. Network 784-1200-1200-10, Activation: Relu Result (JPN) Masatoshi October 27, 2016 12 / 18

13. Matrix Factorization Matrix Factorization Rating matrix is given. ui ....user feature, vj ...item feature , Rij ... rating matrix. When learning, use SGD. ui+1 ← ui − ui [(Ri,j − uT i vj )2 + λu2 i ] vj+1 ← vj − vj [(Ri,j − uT i vj )2 + λv2 j ] (JPN) Masatoshi October 27, 2016 13 / 18

14. Matrix Factorization Matrix Factorization with SGLD p(R|U, V , τ) = L i=1 M j=1 [N(Rij |UT i Vj , τ−1 ]Iij p(U|λU) = L i=1 N(Ui |0, λ−1 U ) p(V |λV ) = M j=1 N(Vj |0, λ−1 V ) λUd ∼ Gamma(α0, β0) λVd ∼ Gamma(α0, β0) Use Gibbs Sampling. When updating u and v, SGLD is used. λ is automatically tuned. (JPN) Masatoshi October 27, 2016 14 / 18

15. Matrix Factorization Neural Network Matrix Factorization Estimate Xn,m by the equation: ˆXn,m = fθ(Un, Vm) Cost function: (Xn,m − ˆXn,m)2 + λ[ Un 2 2 + Vm 2 2] Update θ, un, vm at the same time NNMF can reach state of the art accuracy.... (JPN) Masatoshi October 27, 2016 15 / 18

16. Matrix Factorization Results Use ML-100K, ML-1M data set. Evaluate by root mean square method(RMSE). Unfortunately, state of the art accuracy is not reproduced.... (JPN) Masatoshi October 27, 2016 16 / 18

17. Matrix Factorization Discussion Does data generated by GANs help classiﬁers? What is a good method of combining Neural Networks with matrix factorization? (JPN) Masatoshi October 27, 2016 17 / 18

18. Matrix Factorization References Papers Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC Neural Network Matrix Factorization A Complete Recipe for Stochastic Gradient MCMC Bayesian Dark Knowledge Probabilistic Matrix Factorization (JPN) Masatoshi October 27, 2016 18 / 18

Bayesian Dark Knowledge and Matrix Factorization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian Dark Knowledge and Matrix Factorization

Similar to Bayesian Dark Knowledge and Matrix Factorization (20)

More from Preferred Networks

More from Preferred Networks (20)

Recently uploaded

Recently uploaded (20)

Bayesian Dark Knowledge and Matrix Factorization