SlideShare a Scribd company logo
BAYESIAN NEURAL
NETWORKS
Introduction
Neural networks
Bayesian neural networks
Malina Kirn, Scientific ComputingNovember 11,
2008
1
Approximating the probability
November 11, 2008Bayesian Neural Networks, Malina Kirn
2
 We wish to develop some algorithm that will
approximate the probability that a given event is
signal.
 The MC then has the input distribution (x(1)…x(n))
and the truth/target value (y(1)…y(n)), which will
take the value of 0 (for background) or 1 (for
signal).
 We want P(y(n+1)=1|x(n+1),(x(1),y(1))…(x(n),y(n)))
 An intelligent algorithm will capitalize on
correlations (possibly non-linear) in x.
Measure the frequency of MC signal and background
events in a particular bin in physical space and assign
the probability of a real event as signal accordingly.
The Frequentist’s view of MVA discrimination3
P(y(n+1)=1|x(n+1),(x(1),y(1))…(x(n),y(n)))
x1
x2
NN: Approximate the shape of the physical space
with a sum of weighted and shifted non-linear
functions. Iterate to find the appropriate parameters
The Computationalist’s view of MVA
discrimination
4
Δy=Σi=1…n|y(i)-f(x(i), θ)|2
θ1
θ2
Neural
network
s
5
f(x)
h1(x) h2(x) h3(x) h4(x)
x1 x2 x3
v1 v2 v3 v4
u11
u34
u24
b
a1 a2 a3 a4
)tanh()(
)))((exp(1
1
)(





i iijjj
j jj
xuaxh
xhvb
xf



Neural
Network
s
November 11, 2008Bayesian Neural Networks, Malina Kirn6
Steepest Descent
 Labeling all network parameters uij, vj, (weights)
aj, and b (biases) as θ, then the goal is to iterate
through θ space minimizing the error.
 Typically done via Steepest Descent (SD), though
there are many alternatives.
 θ(k+1)=θ(k)-c (Δy(θ(k)))
 When to stop evolving?
Δ
Train & Validation samples
November 11, 2008Bayesian Neural Networks, Malina Kirn
7
Δy
k (NN epoch/iteration)
Stopping condition
Validation
Train
Number of hidden nodes and
layers?
November 11, 2008Bayesian Neural Networks, Malina Kirn
8
 A NN with an infinite number of hidden nodes
can approximate an arbitrarily complicated
function.
 Too many hidden nodes/layers and you
introduce instability.
 Common to characterize a network topology
as good based on how well it performs on the
validation sample.
 Do this too often and you can no longer
estimate the error of the NN from the validation
sample. Use a third sample, called the test
sample, to do so.
Calculate the probability that a point in the weight and
bias space, θ, represents the MC data. Perform a
probability-weighted average of the network output for
The Bayesian’s view of MVA discrimination9
P(θ|(x(1),y(1)),…, (x(n),y(n)) )
θ1 θ2
Bayesian Neural Network
November 11, 2008Bayesian Neural Networks, Malina Kirn
10



K
k
k
nn
xf
K
xBNN
1
)1()1(
),(
1
)( 

 where θ space is sampled K times from the
distribution given by
 f is the output value calculated by a ‘normal’ NN
defined by θk and given input x(n+1)
)),)...(,(|( )()()1()1( nn
yxyxP


Bayes’ Theorem
November 11, 2008Bayesian Neural Networks, Malina Kirn
11
)()|(
)()|(),|(
xPxyP
PxPxyP 

),|()),(),...,,(|( )()()1()1(
yxPyxyxP nn
 

),(
)()|,(
),|(
yxP
PyxP
yxP

 
)(),|(  PxyP
Calculating the posterior
November 11, 2008Bayesian Neural Networks, Malina Kirn
12
)(),|(),|(  PxyPyxP 


n
i
ii
xyPxyP 1
)()(
),|(),|( 
This is exactly what the NN is doing! f(x(i),θ) is the
probability that y(i)=1 given x(i) and θ and 1-f(x(i),θ)
is the probability that y(i)=0.
)()(
1)(
1
)(
)),(1(),(),|(
ii
yin
i
yi
xfxfxyP 

  

Calculating the posterior
November 11, 2008Bayesian Neural Networks, Malina Kirn
13
)(),|(),|(  PxyPyxP 
P(θ) is the prior probability and represents our first
guess, without any data, of what P(θ|x,y) will be.
Since θ are the weights and biases in our functional
approximation, it’s reasonable to say that positive θ
are as likely as negative θ. It’s also unlikely that a
single |θ| will be much larger than another, though
we do want to allow for large |θ|.
Priors & hyperpriors
November 11, 2008Bayesian Neural Networks, Malina Kirn
14
 Model the prior probability for each θ as a gaussian
centered about zero with width σ.
 Typically have classes - u’s, v’s (weights), a’s, b (biases)
and one σ for each class (σu, σv, σa, σb)
 What should σ be? Model P(σ), the hyperprior, as a
function that allows large values with a few chosen
parameters, the hyperparameters.
 





 
 

 dP
x
dPPP )(
2
exp
2
1
)()|()( 2
2
Sampling θ space
November 11, 2008Bayesian Neural Networks, Malina Kirn
15
 Now that we can calculate P(θ|x,y), how do we traverse θ
space in a representative fashion?
 Rejection sampling
 Draw θ randomly from the prior, P(θ). Include the point in the
sum with probability given by its posterior, P(θ|x,y).
 Inefficient, grows exponentially with n.
 Markov Chain Monte Carlo
 Set ‘potential’ = -ln(P(θ|x,y) ), add a kinetic energy term,
calculate ‘motion’ using Hamiltonian (Metropolis algorithm,
simulated annealing).
 High correlation between θ points visited. Keep every L
steps. Discard first part of chain, as not representative of
P(θ|x,y).



K
k
k
nn
xf
K
xBNN
1
)1()1(
),(
1
)( 

Stopping condition
November 11, 2008Bayesian Neural Networks, Malina Kirn
16
 When is K big enough?
 As before, have a validation sample, let’s say
of size m and calculate the error:
Δy=Σi=1…m|y(i)-BNN(x(i), θ)|2
 Stop when Δy is ‘small enough.’



K
k
k
nn
xf
K
xBNN
1
)1()1(
),(
1
)( 

Why use BNN?
November 11, 2008Bayesian Neural Networks, Malina Kirn
17
 In theory, averaging multiple NNs together
produces a more stable and accurate final
result.
 In practice, single NNs generated via the
MCMC technique have a smaller integrated
area under the ROC curve than the
corresponding BNN.
background efficiency
signal
efficiency
good
bad

More Related Content

What's hot

proposal_pura
proposal_puraproposal_pura
proposal_pura
Erick Lin
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
NYversity
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
nozyh
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
butest
 
Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_application
Cemal Ardil
 

What's hot (20)

proposal_pura
proposal_puraproposal_pura
proposal_pura
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Bayesian Core: Chapter 8
Bayesian Core: Chapter 8Bayesian Core: Chapter 8
Bayesian Core: Chapter 8
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
The reversible residual network
The reversible residual networkThe reversible residual network
The reversible residual network
 
Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_application
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Low-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamicsLow-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamics
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 

Similar to Bayesian Neural Networks

Statistical Analysis of Neural Coding
Statistical Analysis of Neural CodingStatistical Analysis of Neural Coding
Statistical Analysis of Neural Coding
Yifei Shea, Ph.D.
 

Similar to Bayesian Neural Networks (20)

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Probability Formula sheet
Probability Formula sheetProbability Formula sheet
Probability Formula sheet
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
 
Semi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleSemi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster Ensemble
 
PhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-SeneviratnePhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-Seneviratne
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
Statistical Analysis of Neural Coding
Statistical Analysis of Neural CodingStatistical Analysis of Neural Coding
Statistical Analysis of Neural Coding
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagation
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 

Recently uploaded

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Sérgio Sacani
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
Sérgio Sacani
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 

Recently uploaded (20)

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
GBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound MicroscopeGBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound Microscope
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)Microbial Type Culture Collection (MTCC)
Microbial Type Culture Collection (MTCC)
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 

Bayesian Neural Networks

  • 1. BAYESIAN NEURAL NETWORKS Introduction Neural networks Bayesian neural networks Malina Kirn, Scientific ComputingNovember 11, 2008 1
  • 2. Approximating the probability November 11, 2008Bayesian Neural Networks, Malina Kirn 2  We wish to develop some algorithm that will approximate the probability that a given event is signal.  The MC then has the input distribution (x(1)…x(n)) and the truth/target value (y(1)…y(n)), which will take the value of 0 (for background) or 1 (for signal).  We want P(y(n+1)=1|x(n+1),(x(1),y(1))…(x(n),y(n)))  An intelligent algorithm will capitalize on correlations (possibly non-linear) in x.
  • 3. Measure the frequency of MC signal and background events in a particular bin in physical space and assign the probability of a real event as signal accordingly. The Frequentist’s view of MVA discrimination3 P(y(n+1)=1|x(n+1),(x(1),y(1))…(x(n),y(n))) x1 x2
  • 4. NN: Approximate the shape of the physical space with a sum of weighted and shifted non-linear functions. Iterate to find the appropriate parameters The Computationalist’s view of MVA discrimination 4 Δy=Σi=1…n|y(i)-f(x(i), θ)|2 θ1 θ2
  • 5. Neural network s 5 f(x) h1(x) h2(x) h3(x) h4(x) x1 x2 x3 v1 v2 v3 v4 u11 u34 u24 b a1 a2 a3 a4 )tanh()( )))((exp(1 1 )(      i iijjj j jj xuaxh xhvb xf    Neural Network s
  • 6. November 11, 2008Bayesian Neural Networks, Malina Kirn6 Steepest Descent  Labeling all network parameters uij, vj, (weights) aj, and b (biases) as θ, then the goal is to iterate through θ space minimizing the error.  Typically done via Steepest Descent (SD), though there are many alternatives.  θ(k+1)=θ(k)-c (Δy(θ(k)))  When to stop evolving? Δ
  • 7. Train & Validation samples November 11, 2008Bayesian Neural Networks, Malina Kirn 7 Δy k (NN epoch/iteration) Stopping condition Validation Train
  • 8. Number of hidden nodes and layers? November 11, 2008Bayesian Neural Networks, Malina Kirn 8  A NN with an infinite number of hidden nodes can approximate an arbitrarily complicated function.  Too many hidden nodes/layers and you introduce instability.  Common to characterize a network topology as good based on how well it performs on the validation sample.  Do this too often and you can no longer estimate the error of the NN from the validation sample. Use a third sample, called the test sample, to do so.
  • 9. Calculate the probability that a point in the weight and bias space, θ, represents the MC data. Perform a probability-weighted average of the network output for The Bayesian’s view of MVA discrimination9 P(θ|(x(1),y(1)),…, (x(n),y(n)) ) θ1 θ2
  • 10. Bayesian Neural Network November 11, 2008Bayesian Neural Networks, Malina Kirn 10    K k k nn xf K xBNN 1 )1()1( ),( 1 )(    where θ space is sampled K times from the distribution given by  f is the output value calculated by a ‘normal’ NN defined by θk and given input x(n+1) )),)...(,(|( )()()1()1( nn yxyxP  
  • 11. Bayes’ Theorem November 11, 2008Bayesian Neural Networks, Malina Kirn 11 )()|( )()|(),|( xPxyP PxPxyP   ),|()),(),...,,(|( )()()1()1( yxPyxyxP nn    ),( )()|,( ),|( yxP PyxP yxP    )(),|(  PxyP
  • 12. Calculating the posterior November 11, 2008Bayesian Neural Networks, Malina Kirn 12 )(),|(),|(  PxyPyxP    n i ii xyPxyP 1 )()( ),|(),|(  This is exactly what the NN is doing! f(x(i),θ) is the probability that y(i)=1 given x(i) and θ and 1-f(x(i),θ) is the probability that y(i)=0. )()( 1)( 1 )( )),(1(),(),|( ii yin i yi xfxfxyP      
  • 13. Calculating the posterior November 11, 2008Bayesian Neural Networks, Malina Kirn 13 )(),|(),|(  PxyPyxP  P(θ) is the prior probability and represents our first guess, without any data, of what P(θ|x,y) will be. Since θ are the weights and biases in our functional approximation, it’s reasonable to say that positive θ are as likely as negative θ. It’s also unlikely that a single |θ| will be much larger than another, though we do want to allow for large |θ|.
  • 14. Priors & hyperpriors November 11, 2008Bayesian Neural Networks, Malina Kirn 14  Model the prior probability for each θ as a gaussian centered about zero with width σ.  Typically have classes - u’s, v’s (weights), a’s, b (biases) and one σ for each class (σu, σv, σa, σb)  What should σ be? Model P(σ), the hyperprior, as a function that allows large values with a few chosen parameters, the hyperparameters.              dP x dPPP )( 2 exp 2 1 )()|()( 2 2
  • 15. Sampling θ space November 11, 2008Bayesian Neural Networks, Malina Kirn 15  Now that we can calculate P(θ|x,y), how do we traverse θ space in a representative fashion?  Rejection sampling  Draw θ randomly from the prior, P(θ). Include the point in the sum with probability given by its posterior, P(θ|x,y).  Inefficient, grows exponentially with n.  Markov Chain Monte Carlo  Set ‘potential’ = -ln(P(θ|x,y) ), add a kinetic energy term, calculate ‘motion’ using Hamiltonian (Metropolis algorithm, simulated annealing).  High correlation between θ points visited. Keep every L steps. Discard first part of chain, as not representative of P(θ|x,y).    K k k nn xf K xBNN 1 )1()1( ),( 1 )(  
  • 16. Stopping condition November 11, 2008Bayesian Neural Networks, Malina Kirn 16  When is K big enough?  As before, have a validation sample, let’s say of size m and calculate the error: Δy=Σi=1…m|y(i)-BNN(x(i), θ)|2  Stop when Δy is ‘small enough.’    K k k nn xf K xBNN 1 )1()1( ),( 1 )(  
  • 17. Why use BNN? November 11, 2008Bayesian Neural Networks, Malina Kirn 17  In theory, averaging multiple NNs together produces a more stable and accurate final result.  In practice, single NNs generated via the MCMC technique have a smaller integrated area under the ROC curve than the corresponding BNN. background efficiency signal efficiency good bad