KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
We examine the effectiveness of randomized quasi Monte Carlo (RQMC) to improve the convergence rate of the mean integrated square error, compared with crude Monte Carlo (MC), when estimating the density of a random variable X defined as a function over the s-dimensional unit cube (0,1)^s. We consider histograms and kernel density estimators. We show both theoretically and empirically that RQMC estimators can achieve faster convergence rates in
some situations.
This is joint work with Amal Ben Abdellah, Art B. Owen, and Florian Puchhammer.
We present recent result on the numerical analysis of Quasi Monte-Carlo quadrature methods, applied to forward and inverse uncertainty quantification for elliptic and parabolic PDEs. Particular attention will be placed on Higher
-Order QMC, the stable and efficient generation of
interlaced polynomial lattice rules, and the numerical analysis of multilevel QMC Finite Element discretizations with applications to computational uncertainty quantification.
We compute a low-rank surrogate (response surface) approximation to the solution of stochastic PDE. This is a Karhunen-Loeve/polynomial chaos approximation. After that, to compute required statistics, we sample this cheap surrogate, avoiding very expensive solution of the deterministic problem.
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...sipij
The spectral density of stable signals with p-adic times is already estimated under various conditions. The
estimate is made by constructing a periodogram that is subsequently smoothed by a spectral window. It is
clear that the convergence rate of this estimator depends on the bandwidth of the spectral window (called
the smoothing parameter). This work gives a method to select the smoothing parameter in an optimal way,
i.e. the estimator converges to the spectral density with the bestrate.
The method is inspired by the cross-validation method, which consists in minimizing the estimate of the
integrated square error.
Similar to (DL輪読)Variational Dropout Sparsifies Deep Neural Networks (20)
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
4. 変分推論
¤ 近似分布 𝑞(𝑤|𝜙)を考えて,真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝 (𝑤|𝐷)]を
最⼩化する.
¤ これは次の変分下界を最⼤化することと等価
¤ 再パラメータ化トリックによって,変分下界は𝜙について微分可能になる.
¤ ミニバッチ において,下界と下界の勾配の不偏推定量は
ただし
Variational Dropout Sparsifies Deep Neural Networks
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
noise ⌅ to the layer input
procedure (Hinton et al., 20
B = (A ⌅)W
The original version of dro
nary Dropout, was presente
(Hinton et al., 2012). It me
put matrix is put to zero
as a dropout rate. Later
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise i
multiplying the inputs by
to putting Gaussian noise
dure can be used to obta
the model’s weights (Wan
et al., 2015). That is, puttin
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
put matrix is put to zero w
as a dropout rate. Later t
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise in
multiplying the inputs by
to putting Gaussian noise
dure can be used to obtai
the model’s weights (Wan
et al., 2015). That is, puttin
⇠ij ⇠ N(1, ↵) on a weigh
of wij from q(wij | ✓ij, ↵)
becomes a random variable
wij = ✓ij⇠ij = ✓ij(1 +
p
✏ij s N
Gaussian Dropout training
timization of the expected
when we use the reparamete
sample W s q(W | ✓, ↵) pe
pectation. Variational Drop
explicitly uses q(W | ✓, ↵) a
tribution for a model with
The parameters ✓ and ↵ of
tuned via stochastic variatio
are the variational paramet
The prior distribution p(W
n=1
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
that reduces the variance of this gradient estimator even fur-
nary Dropout, was presented with
(Hinton et al., 2012). It means th
put matrix is put to zero with p
as a dropout rate. Later the sa
Gaussian Dropout with continuou
p
1 p ) works as well and is simila
dropout rate p (Srivastava et al.
to use continuous noise instead
multiplying the inputs by a Gau
to putting Gaussian noise on th
dure can be used to obtain a p
the model’s weights (Wang &
et al., 2015). That is, putting mul
⇠ij ⇠ N(1, ↵) on a weight wij
of wij from q(wij | ✓ij, ↵) = N(
becomes a random variable param
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij)
✏ij s N(0, 1
Gaussian Dropout training is eq
timization of the expected log l
when we use the reparameterizati
sample W s q(W | ✓, ↵) per min
pectation. Variational Dropout e
explicitly uses q(W | ✓, ↵) as an a
tribution for a model with a spe
The parameters ✓ and ↵ of the di
tuned via stochastic variational i
are the variational parameters, a
The prior distribution p(W) is ch
scale uniform to make the Variati
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
The or
nary D
(Hinton
put ma
as a dr
Gaussi
p
1 p ) w
dropou
to use
multipl
to putt
dure c
the mo
et al., 2
⇠ij ⇠
of wij
becom
wij =
Gaussi
5. ドロップアウト
¤ 全結合層 において,ドロップアウトは各訓練処理において
ランダムなノイズ を加える.
¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる
¤ 𝑊にガウスノイズを⼊れることは, から
𝑊をサンプリングすることと等価
¤ すると,確率変数𝑤は𝜃によって次のようにパラメータ化される.
In this section we consider a single fully-connected layer
with I input neurons and O output neurons before a non-
linearity. We denote an output matrix as BM⇥O
, input ma-
trix as AM⇥I
and a weight matrix as WI⇥O
. We index
the elements of these matrices as bmj, ami and wij respec-
tively. Then B = AW.
Dropout is one of the most popular regularization methods
for deep neural networks. It injects a multiplicative random
DKL(q(W | ✓,
bound (1) doe
Maximization
comes equival
likelihood (2) w
sian Dropout t
Dropout with fi
vides a way to
ational lower b
ational Dropout Sparsifies Deep Neural Networks
w)) ! max
2
(1)
n | xn, w)] (2)
g-likelihood LD( )
(w)), which acts as
tions in (1) and (2)
nal lower bound (1)
noise ⌅ to the layer input A at each iteration of training
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
nal Dropout Sparsifies Deep Neural Networks
! max
2
(1)
n, w)] (2)
kelihood LD( )
), which acts as
ns in (1) and (2)
ower bound (1)
tly. However, it
mpling and opti-
hastic optimiza-
noise ⌅ to the layer input A at each iteration of training
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
eep Neural Networks
se ⌅ to the layer input A at each iteration of training
cedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
original version of dropout, so-called Bernoulli or Bi-
y Dropout, was presented with ⇠mi s Bernoulli(1 p)
nton et al., 2012). It means that each element of the in-
matrix is put to zero with probability p, also known
a dropout rate. Later the same authors reported that
ussian Dropout with continuous noise ⇠mi s N(1, ↵ =
) works as well and is similar to Binary Dropout with
pout rate p (Srivastava et al., 2014). It is beneficial
use continuous noise instead of discrete one because
tiplying the inputs by a Gaussian noise is equivalent
putting Gaussian noise on the weights. This proce-
e can be used to obtain a posterior distribution over
model’s weights (Wang & Manning, 2013; Kingma
l., 2015). That is, putting multiplicative Gaussian noise
⇠ N(1, ↵) on a weight wij is equivalent to sampling
wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
omes a random variable parametrized by ✓ij.
Variational Dropout Sparsifies Deep Neural Networks
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
onsists of two parts, the expected log-likelihood LD( )
the KL-divergence DKL(q (w) k p(w)), which acts as
gularization term.
Stochastic Variational Inference
he case of complex models expectations in (1) and (2)
intractable. Therefore the variational lower bound (1)
its gradients can not be computed exactly. However, it
till possible to estimate them using sampling and opti-
e the variational lower bound using stochastic optimiza-
.
follow (Kingma & Welling, 2013) and use the Repa-
eterization Trick to obtain an unbiased differentiable
ibatch-based Monte Carlo estimator of the expected
likelihood (3). The main idea is to represent the para-
ric noise q (w) as a deterministic differentiable func-
noise ⌅ to the layer input A at each iteration of
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠)
The original version of dropout, so-called Bernou
nary Dropout, was presented with ⇠mi s Bernoul
(Hinton et al., 2012). It means that each element o
put matrix is put to zero with probability p, als
as a dropout rate. Later the same authors repo
Gaussian Dropout with continuous noise ⇠mi s N
p
1 p ) works as well and is similar to Binary Drop
dropout rate p (Srivastava et al., 2014). It is b
to use continuous noise instead of discrete one
multiplying the inputs by a Gaussian noise is eq
to putting Gaussian noise on the weights. Thi
dure can be used to obtain a posterior distribut
the model’s weights (Wang & Manning, 2013;
et al., 2015). That is, putting multiplicative Gauss
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij).
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵
n=1
parts, the expected log-likelihood LD( )
gence DKL(q (w) k p(w)), which acts as
erm.
ariational Inference
mplex models expectations in (1) and (2)
Therefore the variational lower bound (1)
can not be computed exactly. However, it
o estimate them using sampling and opti-
al lower bound using stochastic optimiza-
ma & Welling, 2013) and use the Repa-
ick to obtain an unbiased differentiable
Monte Carlo estimator of the expected
. The main idea is to represent the para-
w) as a deterministic differentiable func-
✏) of a non-parametric noise ✏ s p(✏).
s us to obtain an unbiased estimate of
e we denote objects from a mini-batch as
)=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case
when we use the reparameterization trick and draw a single
sample W s q(W | ✓, ↵) per minibatch to estimate the ex-
yn | xn, w)] (2)
og-likelihood LD( )
p(w)), which acts as
ations in (1) and (2)
nal lower bound (1)
exactly. However, it
g sampling and opti-
stochastic optimiza-
) and use the Repa-
biased differentiable
tor of the expected
o represent the para-
differentiable func-
ic noise ✏ s p(✏).
nbiased estimate of
rom a mini-batch as
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case
6. 変分ドロップアウト
¤ をパラメータ をもつ近似分布と考えると,このパ
ラメータは変分推論で計算することができる(変分ドロップアウト)
¤ 𝛼を固定すると,変分ドロップアウトとガウスドロップアウトは等価に
なる.
¤ KL項が⼀定になるため.
¤ 変分ドロップアウトにおいて,𝛼は学習するパラメータになっている!
¤ つまり, 𝛼を学習時に⾃動的に決定することができる.
¤ しかし,先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている.
¤ ノイズが⼊りすぎると,勾配の分散が⼤きくなる.
¤ しかし, 𝛼が無限⼤(=ドロップアウト率が1)まで設定できたほうが⾯⽩い
結果がでそう.
✏ij s N (0, 1)
aussian Dropout training is equivalent to stochastic op-
mization of the expected log likelihood (2) in the case
hen we use the reparameterization trick and draw a single
ample W s q(W | ✓, ↵) per minibatch to estimate the ex-
ectation. Variational Dropout extends this technique and
xplicitly uses q(W | ✓, ↵) as an approximate posterior dis-
ibution for a model with a special prior on the weights.
he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
uned via stochastic variational inference, i.e. = (✓, ↵)
re the variational parameters, as denoted in Section 3.2.
he prior distribution p(W) is chosen to be improper log-
cale uniform to make the Variational Dropout with fixed ↵
quivalent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
n this model, it is the only prior distribution that makes
ariational inference consistent with Gaussian Dropout
Kingma et al., 2015). When parameter ↵ is fixed, the
DKL(q(W | ✓, ↵) k p(W)) term in the variational lower
ound (1) does not depend on ✓ (Kingma et al., 2015).
Maximization of the variational lower bound (1) then be-
e use the reparameterization trick and draw a single
W s q(W | ✓, ↵) per minibatch to estimate the ex-
n. Variational Dropout extends this technique and
ly uses q(W | ✓, ↵) as an approximate posterior dis-
n for a model with a special prior on the weights.
ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
ia stochastic variational inference, i.e. = (✓, ↵)
variational parameters, as denoted in Section 3.2.
or distribution p(W) is chosen to be improper log-
niform to make the Variational Dropout with fixed ↵
ent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
model, it is the only prior distribution that makes
nal inference consistent with Gaussian Dropout
a et al., 2015). When parameter ↵ is fixed, the
(W | ✓, ↵) k p(W)) term in the variational lower
(1) does not depend on ✓ (Kingma et al., 2015).
zation of the variational lower bound (1) then be-
equivalent to maximization of the expected log-
od (2) with fixed parameter ↵. It means that Gaus-
opout training is exactly equivalent to Variational
7. Additive Noise Reparameterization
¤ 下界の勾配 の2つめの乗数はαが⼤きくなるとノイズが
⼤きくなる.
¤ そこで,つぎのような式変形をする.
¤ すると,
となるので,勾配の分散を⼤幅に減らすことができる!
¤ これによって, 𝛼を∞にまで⼤きく設定することができる.
4.1. Additive Noise Reparameterization
Training Neural Networks with Variational Dropout is dif-
ficult when dropout rates ↵ij are large because of a huge
variance of stochastic gradients (Kingma et al., 2015). The
cause of large gradient variance arises from multiplicative
noise. To see it clearly, we can rewrite the gradient of LSGVB
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
can be decomposed into a sum:
DKL(q(W | ✓, ↵)k p(W)) =
=
X
ij
DKL(q(wij | ✓ij, ↵ij) k p(wij)) (12)
The log-scale uniform prior distribution is an improper
prior, so the KL divergence can only be calculated up to
an additive constant C (Kingma et al., 2015).
DKL(q(wij | ✓ij, ↵ij) k p(wij)) =
=
1
2
log ↵ij E✏⇠N (1,↵ij ) log |✏| + C
(13)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N (1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵ 1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
was obtained by averaging over 107
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓
= 1, ✏ij ⇠ N(0, 1)
(11)
an add
In the V
the exp
puted
term ca
approx
ever th
rate ↵
that is
sigmoi
value o
was ob
than 2
⇡ k
k1
We use
negativ
noise. To see it clearly, we can rewrite the gradient of LSGVB
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
w = ✓ (1 +
p
↵ · ✏ ) = ✓ + · ✏
The log-scale uniform prior dis
prior, so the KL divergence can
an additive constant C (Kingma e
DKL(q(wij | ✓ij, ↵ij
=
1
2
log ↵ij E✏⇠N(1,
In the Variational Dropout model
the expectation E✏⇠N(1,↵ij ) log |✏
puted analytically (Kingma et al
term can be sampled and then app
approximations were provided in
ever they are accurate only for sm
rate ↵ (↵ 1). We propose ano
that is tight for all values of alph
sigmoid function. Different app
value of DKL are presented in
was obtained by averaging over 1
than 2 ⇥ 10 3
variance of the esti
DKL(q(wij | ✓ij, ↵ij)
⇡ k1 (k2 + k3 log ↵ij)) 0.5
k1 = 0.63576 k2 = 1.87320
@✓ij
In the case of original
multiplier in (9) is very
wij = ✓
@wij
@✓ij
✏
We propose a trick that
variance of this term in
is to replace the multipli
an exactly equivalent a
2
ij = ↵ij✓2
ij is treated
ter this trick we will op
w.r.t. (✓, ). However,
paper, as it has a nice in
wij = ✓ij(1 +
p
↵
@wij
@✓ij
= 1,
@✓ij
= 1 + ↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
approxima
ever they
rate ↵ (↵
that is tigh
sigmoid f
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K
ただし
𝛼が⼤きくなると,
この項も⼤きくなる
wij = ✓ij(1 + ↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
puted ana
term can b
approxima
ever they a
rate ↵ (↵
that is tigh
sigmoid fu
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K
8. KL項について
¤ [Kingma+15]で提案されたKL項(正規化項)の近似⽅法は, 𝛼が1以
下の場合のみ.
¤ 本研究では,すべての値の𝛼で適⽤可能なKL項を提案
ij
+
p
↵ij · ✏ij),
+
p
↵ij · ✏ij,
N(0, 1)
(10)
ws us to drastically reduce the
ase when ↵ij is large. The idea
e noise term 1+
p
↵ij ·✏ij with
ve noise term ij · ✏ij, where
new independent variable. Af-
ze the variational lower bound
will still use ↵ throughout the
etation as a dropout rate.
✏ij) = ✓ij + ij · ✏ij
✏ij ⇠ N(0, 1)
(11)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N(1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵ 1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
was obtained by averaging over 107
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
Variational Dropout Sparsifies Deep Neural Networks
↵ij✓2
ij goes to zero as w
is effectively a delta func
✓ij ! 0,
q(wij | ✓ij, ↵ij) !
In the case of linear regr
alytically. We denote a d
RD
. If ↵ is fixed, the op
tained in a closed form.
✓ = (X>
X + diag(
9. スパース変分ドロップアウトの計算
¤ 下界の学習では,提案するAdditive Noise Reparameterizationに加
えて, Local Reparameterization Trick[Kingma+15]を適⽤して分
散を抑える.
¤ Local Reparameterization Trickは以前の輪読スライドを参照.
¤ 全結合層だけではなく,畳込み層でも適⽤可能.
der DKL + 0.5 log(1 + ↵ij )
moid function of log ↵ij, so we fit
(k2 +k3 log ↵ij) to this curve.
oximation is extremely accurate
m absolute deviation on the full
+1); the original approximation
0.04 maximum absolute devia-
0]).
↵ approaches infinity, the KL-
constant. As in this model the
up to an additive constant, it is
k1 so that the KL-divergence
to infinity. It allows us to com-
ural networks of different sizes.
see that DKL term increases
eans that this regularization term
orresponds to a Binary Dropout
p
p ). Intuitively it means that the
lmost always dropped from the
e does not influence the model
nd is put to zero during the test-
tuation from another angle. In-
ds to infinitely large multiplica-
ns that the value of this weight
m and its magnitude will be un-
lower bound (3) with our approximation of KL-divergence
(14). We apply Sparse Variational Dropout to both convo-
lutional and fully-connected layers. To reduce the variance
of LSGVB
we use a combination of the Local Reparameter-
ization Trick and Additive Noise Reparameterization. In
order to improve convergence, optimization is performed
w.r.t. (✓, log 2
).
For a fully connected layer we use the same notation as in
Section 3.3. In this case, Sparse Variational Dropout with
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
t. As in this model the
n additive constant, it is
o that the KL-divergence
ity. It allows us to com-
works of different sizes.
t DKL term increases
t this regularization term
nds to a Binary Dropout
uitively it means that the
lways dropped from the
not influence the model
ut to zero during the test-
from another angle. In-
nfinitely large multiplica-
the value of this weight
s magnitude will be un-
l prediction and decrease
refore it is beneficial to
o zero in such a way that
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
These formulae can be used for the implementation of
Sparse Variational Dropout layers. We will provide a refer-
ence implementation using Theano (Bergstra et al., 2010)
11. Additive Noise Reparameterizationの検証
¤ Additive Noise Reparameterizationによって分散が抑えられている
かを検証
¤ 本研究の⼿法を適⽤しない⽅法と,スパース性&下界の精度について⽐較Variational Dropout Sparsifies Deep Neural Network
Figure 2. Original parameterization vs Additive Noise Reparam-
Table 1. Comparison of
(Pruning (Han et al., 2015
rich et al., 2017)) on Le
the highest level of spars
Network Method
Original
Pruning
LeNet-300-100 DNS
SWS
(ours) Sparse VD
Original
Pruning
LeNet-5-Caffe DNS
SWS
(ours) Sparse VD
提案⼿法のほうが
スパースになるのが速い
提案⼿法の下界のほう
が速く収束
12. MNIST
¤ LeNetでMNISTを学習
¤ LeNet-300-100(全結合)とLeNet-5-Caffe(畳込み)
¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight
Sharing[Ullrich+ 17]と⽐較
Variational Dropout Sparsifies Deep Neural Networks
s Additive Noise Reparam-
eterization leads to a much
he variational lower bound
Table 1. Comparison of different sparsity-inducing techniques
(Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull-
rich et al., 2017)) on LeNet architectures. Our method provides
the highest level of sparsity with a similar accuracy.
Network Method Error % Sparsity per Layer % |W|
|W6=0|
Original 1.64 1
Pruning 1.59 92.0 91.0 74.0 12
LeNet-300-100 DNS 1.99 98.2 98.2 94.5 56
SWS 1.94 23
(ours) Sparse VD 1.92 98.9 97.2 62.0 68
Original 0.80 1
Pruning 0.77 34 88 92.0 81 12
LeNet-5-Caffe DNS 0.91 86 97 99.3 96 111
SWS 0.97 200
(ours) Sparse VD 0.75 67 98 99.8 95 280
from a random initialization and without data augmenta-
提案⼿法が
最もスパース
提案⼿法が
最もスパース
14. ランダムラベルの学習
¤ [Zhang+ 16]では,CNNがランダムラベルについても学習してしまう
ことが⽰されている.
¤ 通常のドロップアウトではこの問題を解消できない.
¤ 提案⼿法(Sparse VD)では,学習すると重みがすべて1つの値になり,
⼀定の予測しかしないようになった.
¤ しかも,スパース性が100%になった.
¤ スパース性が100%になると重みが0になる(4.3節を参照).
¤ 提案⼿法によって,記憶にペナルティがかけられて,汎化を促進してい
る?
Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. T
networks were trained with Binary Dropout, and Sparse VD networks were trained
overall sparsity level, achieved by our method, is reported as a dashed line. The
sparsity level is high, especially in larger networks.
Table 2. Experiments with random labeling. Sparse Variational
Dropout (Sparse VD) removes all weights from the model and
fails to overfit where Binary Dropout networks (BD) learn the
random labeling perfectly.
Dataset Architecture Train acc. Test acc. Sparsity
MNIST FC + BD 1.0 0.1 —
MNIST FC + Sparse VD 0.1 0.1 100%
CIFAR-10 VGG-like + BD 1.0 0.1 —
CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100%
5.5. Random Labels
Recently is was shown that the CNNs are capable of mem-
orizing the data even with random labeling (Zhang et al.,
2016). The standard dropout as well as other regulariza-
6. Discuss
The “Occam
complex sho
1992). Aut
a Bayesian
different cas
of factorize
Processes, e
(Molchanov
ing Beta dis
ARD-effect
We conside
ational infer
by the partic
distribution
selection. T
approach th