Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(研究会輪読) Weight Uncertainty in Neural Networks

輪読日:2015/11/04

  • Login to see the comments

  • Be the first to like this

(研究会輪読) Weight Uncertainty in Neural Networks

  1. 1. Weight Uncertainty in Neural Networks 04/11/2015 MASAHIRO SUZUKI
  2. 2. Paper Information Title : Weight Uncertainty in Neural Networks (ICML 2015) Authors : Charles Blundell, Jullen Cornebise, KorayKavukcuoglu, Daan Wierstra ◦ Google DeepMind They proposed Bayes by Backprop Motivation ◦ I’d like to know how to treat the model “uncertainty’’ in deep learning approach. ◦ I like Bayesianapproach.
  3. 3. Overfitting and Uncertainty Overfitting ◦ Plain feedforward neural networks are prone to overfitting. Uncertainty ◦ NN often incapable of correctly assessing the uncertainty in the training data. →Overly confident decisions (a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: (f(x)) Figure 1: A sketch of softmax input and output for an idealised binary classification problem. Training data is given between the dashed grey lines. Function point estimate is shown with a solid line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point x⇤ far from the training data. Ignoring function uncertainty, point x⇤ is classified as class 1 with probability 1. have made use of NNs for Q-value function approximation. These are functions that estimate the quality of different actions an agent can make. Epsilon greedy search is often used where the agent Training data Function point estimate Without uncertainty: class 1 with probability 1 With uncertainty: better reflects classification uncertainty [Gal et. al 2015] Function uncertainty
  4. 4. How to prevent overfitting Various regularization schemes have been proposed. ◦ Early stopping ◦ Weight decay ◦ Dropout This paper addresses this problem by using variational Bayesian learning to introduce uncertainty in the weights of the network. ↓ Bayes by Backprop
  5. 5. Contribution They proposed Bayes by Backprop ◦ A simple approximate learning algorithm similar to backpropagation. ◦ All weight are represented by probability distributions over possible values. It achieves good results in several domains. ◦ Classification ◦ Regression ◦ Bandit problem Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a fixed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of thousands, the number of weights in a neural network is easily two orders of magnitude larger, making the optimisa- tion problem much larger scale. Uncertainty in the hidden units allows the expression of uncertainty about a particular the parameters of the categorical distribution through the exponential function then re-norm regression Y is R and P(y|x, w) is a Gaussian – this corresponds to a squared loss. Inputs x are mapped onto the parameters of tion on Y by several successive layers of linear tion (given by w) interleaved with element-wise transforms. The weights can be learnt by maximum likelih tion (MLE): given a set of training examples D the MLE weights wMLE are given by: wMLE = arg max w log P(D|w) = arg max w X i log P(yi|xi, w This is typically achieved by gradient descent propagation), where we assume that log P(D|w entiable in w. Regularisation can be introduced by placing a the weights w and finding the maximum a MAP Classical backpropagation Bayes by Backprop
  6. 6. Related Works Variational approximation ◦ [Graves 2011] → the gradients of this can be made unbiased and this method can be used with non-Gaussian priors. Uncertainty in the hidden units ◦ [Kingma and Welling, 2014] [Rezende et al., 2014] [Gregor et al., 2014] ◦ Variational autoencoder → the number of weights in a neural network is easily two orders of magnitude larger Contextual bandit problems using Thompson sampling ◦ [Thompson, 1933] [Chapelle and Li, 2011] [Agrawal and Goyal, 2012] [May et al., 2012] → Weights with greater uncertainty introduce more variability into the decisions made by the network, leading naturally to exploration.
  7. 7. Point Estimates of NN Neural network : 𝑝(𝑦|𝑥, 𝑤) ◦ Input : 𝑥 ∈ ℝ+ ◦ Output : 𝑦 ∈ 𝒴 ◦ The set of parameters : 𝑤 ◦ Cross-entropy (categorical distiribution), squared loss (Gaussian distribution) Learning 𝐷 = (𝑥/, 𝑦/) ◦ MLE: 𝑤012 = arg max 8 log 𝑝 𝐷 𝑤 = arg max 8 ; log 𝑝(𝑦/ |𝑥/, 𝑤) / ◦ MAP: 𝑤0<= = arg max 8 log 𝑝 𝑤 𝐷 = arg max 8 log 𝑝(𝐷|𝑤) + log 𝑝(𝑤)
  8. 8. Being Bayesian The predictive distribution : 𝑝(𝑦?|𝑥?) ◦ an unknown label : 𝑦? ◦ a test data item : 𝑥? 𝑝 𝑦? 𝑥? = @ 𝑝 𝑦? 𝑥?, 𝑤 𝑝 𝑤 𝐷 𝑑𝑤 = 𝔼+(8|C)[𝑝 𝑦? 𝑥?, 𝑤 ] Taking expectation = an ensemble of an uncountably infinite number of NN ↓ Intractable
  9. 9. Variational Learning The Basyan posterior distribution on the weight : 𝑞(𝑤|𝜃) ◦ parameters : 𝜃 The posterior distribution given the training data : 𝑝 𝑤 𝐷 Find the parameters 𝜃 that minimizes the KL divergence : 𝜃∗ = arg min L ℱ 𝐷, 𝜃 where ℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝 𝑤 𝐷 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝(𝑤) − 𝔼 Q 𝑤 𝜃 [log 𝑝(𝐷|𝑤)] data-dependent part (likelihood cost) prior-dependent part (complexity cost)
  10. 10. Unbiased Monte Carlo gradients Proposition 1. Let 𝜀 be a random variable having a probability density given by 𝑞(𝜀) and let 𝑤 = 𝑡(𝜃, 𝜀) where 𝑡(𝜃, 𝜀) is a deterministic function. Suppose further that the marginal probability density of 𝑤, 𝑞(𝑤|𝜃), is such that 𝑞(𝜀)𝑑𝜀 = 𝑞(𝑤|𝜃)𝑑𝑤. Then for a function 𝑓 with derivatives in 𝑤: U UL 𝔼Q(8|L) 𝑓 𝑤, 𝜃 = 𝔼Q(V) UW(8,L) UL U8 UL + UW(8,L) UL A generalization of the Gaussian reparameterizationtrick
  11. 11. Bayes by backprop We approximate the expected lower bound as ℱ 𝐷, 𝜃 ≈ ; log 𝑞 𝑤(/) 𝜃 − log 𝑝(𝑤(/)) − log 𝑝(𝐷|𝑤 / ) / where 𝑤(/)~𝑞 𝑤(/) 𝜃 (Monte Carlo) ◦ Every term of this approximate cost depends upon the particular weights drawn from the variational posterior.
  12. 12. Gaussian variational posterior 1. Sample 𝜀 ∼ 𝑁(0, 𝐼) 2. Let 𝑤 = 𝜇 + log (1 + exp ( 𝜌)) ∘ 𝜀 3. Let 𝜃 = (𝜇, 𝜌) 4. Let 𝑓(𝑤, 𝜃) = log 𝑞 𝑤 𝜃 − log 𝑝 𝑤 𝑝(𝐷|𝑤) 5. Calculate the gradient with respect to the mean ∆e = 𝜕𝑓(𝑤, 𝜃) 𝜕𝑤 + 𝜕𝑓(𝑤, 𝜃) 𝜕𝜇 6. Calculate the gradient with respect to the standard deviation parameter 𝜌 ∆g = 𝜕𝑓(𝑤, 𝜃) 𝜕𝑤 𝜀 1 + exp(−𝜌) + 𝜕𝑓(𝑤, 𝜃) 𝜕𝜌 7. Update the variational parameters: 𝜇 ← 𝜇 − 𝛼∆e 𝜌 ← 𝜌 − 𝛼∆g
  13. 13. Scalar mixture prior They proposed using scale mixture of two Gaussian densities as the prior ◦ Combined with a diagonal Gaussian posterior ◦ Two degrees of freedom per weight only increases the number of parameters to optimize by a factor of two. 𝑝(𝑤) = ∏ 𝜋𝑁(𝑤l |0, 𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o) l where 𝜎n > 𝜎o, 𝜎o ≪ 1 Empirically they found optimizing the parameters of a prior 𝑝(𝑤) to not be useful, and yield worse results. ◦ It can be easier to change the prior parameters than the posterior parameters. ◦ The prior parameters try to capture the empirical distribution of the weights at the beginning of learning. → pick a fixed-form prior and don’t adjust its hyperparamater
  14. 14. Minibatches and KL re-weighting The minibatch cost for minibatch i = 1,2,...,M: 𝐹/ u (𝐷/, 𝜃) = 𝜋/ 𝐾𝐿[𝑞(𝑤|𝜃) || 𝑃(𝑤)] − 𝔼Q 𝑤 𝜃 [log 𝑃 𝐷/ 𝑤 ] where π ∈ [0,1]0 and ∑ 𝜋/ = 10 /xn ◦ Then 𝔼0 ∑ 𝐹/ u 𝐷/ , 𝜃0 /xn = 𝐹(𝐷, 𝜃) 𝜋 = oyz{ oy|n works well ◦ the first few minibatches are heavily influenced by the complexity cost ◦ the later minibatches are largely influenced by the data
  15. 15. Contextual Bandits Contextual Bandits ◦ Simple reinforcement learning problems Example : Clinical Decision Making Another Example: Clinical Decision Making Repeatedly: 1. A patient comes to a doctor with symptoms, medical history, test results 2. The doctor chooses a treatment 3. The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients. Context 𝑥 Actions 𝑎 ∈ [0, . . , 𝐾] Rewards 𝑟~𝑝(𝑟|𝑥, 𝑎, 𝑤) Online training the agent’s model 𝑝 𝑟 𝑥, 𝑎, 𝑤
  16. 16. Thompson Sampling Thompson sampling ◦ Popular means of picking an action that trades-off between exploitation and exploration. ◦ Necessitates a Bayesian treatment of the model parameters 1. Sample a new set of parameters for the model. 2. Pick the action with the highest expected reward according to the sampled parameters. 3. Update the model. Go to 1.
  17. 17. Thompson Sampling for NN Thompson sampling is easily adapted to neural networks using the variational pasterior. 1. Sample weights from the variational posterior: 𝑤~𝑞(𝑤|𝜃). 2. Receive the context 𝑥. 3. Pick the action 𝑎 that minimises 𝔼+(€|•,‚,8)[𝑟] 4. Receive reward 𝑟. 5. Update variationalparameters 𝜃. Go to 1.
  18. 18. Experiment 1. Classification on MNIST 2. Regression curves 3. Bandits on Mushroom Task
  19. 19. Classification on MNIST Model : ◦ 28×28→(ReLU)→(400,800,1200) →(ReLU)→(400,800,1200) → (Softmax)→ 10 Weight Uncertainty in Neural Networks Table 1. Classification Error Rates on MNIST. ? indicates result used an ensemble of 5 networks. Method #Units/Layer #Weights Test Error SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6% SGD, dropout (Hinton et al., 2012) ⇡ 1.3% SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%? SGD 400 500k 1.83% 800 1.3m 1.84% 1200 2.4m 1.88% SGD, dropout 400 500k 1.51% 800 1.3m 1.33% 1200 2.4m 1.36% Bayes by Backprop, Gaussian 400 500k 1.82% 800 1.3m 1.99% 1200 2.4m 2.04% Bayes by Backprop, Scale mixture 400 500k 1.36% 800 1.3m 1.34% 1200 2.4m 1.32% known that variational methods under-estimate uncertainty (Minka, 2001; 2005; Bishop, 2006) which could lead to 0.8 1.2 1.6 2.0 0 100 200 300 400 500 6 Epochs Testerror(%) Figure 2. Test error on MNIST as tr 0 5 10 15 −0.2 −0.1 0.0 0.1 0.2 Weight Density Figure 3. Histogram of the trained weight for Dropout, plain SGD, and samples from used an ensemble of 5 networks
  20. 20. Classification on MNIST Test error on MNIST as training progress ◦ Bayes by Backprop and dropout converge at similar rates.
  21. 21. Classification on MNIST Density estimates of the weights ◦ Bayes by Backprop : sampled from the variational posterior ◦ Dropout : used at test time ◦ Bayes by Backprop uses the greatest range of weights
  22. 22. Classification on MNIST The level of redundancy in a Bayes by Backprop. ◦ Replace the weights with a constant zero (from the lowest signal-to-noise ratio ). Figure 4. Density and CDF of the Signal-to-Noise ratio over all weights in the network. The red line denotes the 75% cut-off. In Table 2, we examine the effect of replacing the vari- ational posterior on some of the weights with a constant zero, so as to determine the level of redundancy in the network found by Bayes by Backprop. We took a Bayes by Backprop trained network with two layers of 1200 units2 and ordered the weights by their signal-to-noise ra- tio (|µi|/ i). We removed the weights with the lowest sig- nal to noise ratio. As can be seen in Table 2, even when 95% of the weights are removed the network still performs well, with a significant drop in performance once 98% of the weights have been removed. In Figure 4 we examined the distribution of the signal-to- noise relative to the cut-off in the network uses in Table 2. The lower plot shows the cumulative distribution of signal- to-noise ratio, whilst the top plot shows the density. From the density plot we see there are two modalities of signal- to-noise ratios, and from the CDF we see that the 75% cut-off separates these two peaks. These two peaks coin- cide with a drop in performance in Table 2 from 1.24% to 1.29%, suggesting that the signal-to-noise heuristic is in fact related to the test performance. 2 We used a network from the end of training rather than pick- ing a network with a low validation cost found during training, hence the disparity with results in Table 1. The lowest test error a network 20 times smaller and did proportion of weights (at most 11%) ing good test performance. The sca by Bayes by Backprop encourages weights. Many of these weights can b without impacting performance signi 5.2. Regression curves We generated training data from the y = x + 0.3 sin(2⇡(x + ✏)) + 0.3 where ✏ ⇠ N(0, 0.02). Figure 5 sh fitting a neural network to these data tional Gaussian loss. Note that in th space where there are no data, the or reduces the variance to zero and ch lar function, even though there are m olations of the training data. On the averaging affects predictions: where confidence intervals diverge, reflect possible extrapolations. In this cas prefers to be uncertain where there a opposed to a standard neural networ confident. 5.3. Bandits on Mushroom Task We take the UCI Mushrooms data set
  23. 23. Regression curves Training data : 𝑦 = 𝑥 + 0.3sin 2𝜋 𝑥 + 𝜀 + 0.3sin 4𝜋 𝑥 + 𝜀 + 𝜀 where 𝜀~𝑁(0,0.02) Weight Uncertainty in Neural Networks x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xxxx x x x x x xx x x x x x x x x x x x x x xx x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x xx x x x xx x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x x xx x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x xx x x x x x x xx x xx xx x xxx x x x x x x x x x x xx x x x x x x x x x x x xxx x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x xx x xx x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x x x x x x x x xx x x x x xx x x x x x x x x x xx xx x x x xx x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x −0.4 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xxxx x x x x x xx x x x x x x x x x x x x x xx x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x xx x x x xx x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x x xx x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x xx x x x x x x xx x xx xx x xxx x x x x x x x x x x xx x x x x x x x x x x x xxx x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x xx x xx x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x x x x x x x x xx x x x x xx x x x x x x x x x xx xx x x x xx x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x −0.4 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 Figure 5. Regression of noisy data with interquatile ranges. Black crosses are training samples. Red lines are median predictions. Blue/purple region is interquartile range. Left: Bayes by Back- prop neural network, Right: standard neural network. for action selection. We kept t and action tuples in a buffer, ing randomly drawn minibatch steps (64 ⇥ 64 = 4096) per int bandit. A common heuristic fo exploitation is to follow an " bility " propose a uniformly ra the best action according to th Figure 6 compares a Bayes by "-greedy agents, for values of and 5%. An " of 5% appears purely greedy agent does poo ily electing to eat nothing, but it has seen enough data. It se approximation updates allow Bayes by Backprop Standard NN
  24. 24. Bandits on Mushroom Task UCI Mushrooms dataset ◦ Each mushroom has a set of features and is labelled as edible or poisonous. ◦ An edible mushroom : a reward of 5 ◦ A poisonous mushroom : a reward of −35 or 5 (with probability 0.5) ◦ Not to eat a mushroom : a reward of 0 Model : ◦ Input (context and action)→(ReLU)→100→(ReLU)→100→ the expected reward Comparison approach : ◦ 𝜀-greedy policy : with probability 𝜀 propose a uniformly random action, otherwise pick the best action according to the neural network.
  25. 25. Bandits on Mushroom Task Comparison of cumulative regret of various agents. ◦ Regret : the difference between the reward achievable by an oracle and the reward received by an agent. ◦ Oracle : an edible mushroom→5, a poisonous mushroom→0 ◦ Lower is better. x x x xx x x x xx xx x xx x x x x x x x x x x x x x xx xx xxx x x x x x x xx xx x x x xx xx xx xx x xx xx x x xx x x x x x xx xx x x xx x x x x x x x x xx x x x x x x x −0.4 0.0 0.0 0.4 0.8 1.2 x x x xx x x x xx xx x xx x x x x x x x x x x x x x xx xx xxx x x x x x x xx xx x x x xx xx xx xx x xx xx x x xx x x x x x xx xx x x xx x x x x x x x x xx x x x x x x x −0.4 0.0 0.0 0.4 0.8 1.2 Figure 5. Regression of noisy data with interquatile ranges. Black crosses are training samples. Red lines are median predictions. Blue/purple region is interquartile range. Left: Bayes by Back- prop neural network, Right: standard neural network. 1000 10000 0 10000 20000 30000 40000 50000 Step CumulativeRegret 5% Greedy 1% Greedy Greedy Bayes by Backprop Figure 6. Comparison of cumulative regret of various agents on bandit. A com exploitation is bility " propos the best action Figure 6 comp "-greedy agen and 5%. An purely greedy ily electing to it has seen en approximation as for the first approximately eat mushroom from the begi and quickly co most perfect r 6. Discussio We introduced works with u Backprop. It over-explore quickly converges
  26. 26. Conclusion They introduced a new algorithm for learning neural networks with uncertainty on the weights called Bayes by Backprop. The algorithm achieves good results in several domains. ◦ Classifying MNIST digits : Performance from Bayes by Backprop is comparable to that of dropout. ◦ Non-linear regression problem : Bayes by Backprop allows the network to make more reasonable predictions about unseen data. ◦ Contextual bandits : Bayes by Backprop can automatically learn how to trade-off exploration and exploitation.

×