Motivated by a real-world financial dataset, we propose a distributed variational message passing scheme for learning conjugate exponential models. We show that the method can be seen as a projected natural gradient ascent algorithm, and it therefore has good convergence properties. This is supported experimentally, where we show that the approach is robust wrt. common problems like imbalanced data, heavy-tailed empirical distributions, and a high degree of missing values. The scheme is based on map-reduce operations, and utilizes the memory management of modern big data frameworks like Apache Flink to obtain a time-efficient and scalable implementation. The proposed algorithm compares favourably to stochastic variational inference both in terms of speed and quality of the learned models. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes (and approx. 75% latent variables) using a computer cluster with 128 processing units.
Full paper link: http://www.jmlr.org/proceedings/papers/v52/masegosa16.pdf
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...AMIDST Toolbox
Maximum a posteriori (MAP) inference is a particularly complex type of probabilistic inference in Bayesian networks. It consists of finding the most probable configuration of a set of variables of interest given observations on a collection of other variables. In this paper we study scalable solutions to the MAP problem in hybrid Bayesian networks parameterized using conditional linear Gaussian distributions. We propose scalable solutions based on hill climbing and simulated anneal- ing, built on the Apache Flink framework for big data processing. We analyze the scalability of the solution through a series of experiments on large synthetic networks.
Full text paper: http://www.jmlr.org/proceedings/papers/v52/ramos-lopez16.pdf
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...AMIDST Toolbox
Maximum a posteriori (MAP) inference is a particularly complex type of probabilistic inference in Bayesian networks. It consists of finding the most probable configuration of a set of variables of interest given observations on a collection of other variables. In this paper we study scalable solutions to the MAP problem in hybrid Bayesian networks parameterized using conditional linear Gaussian distributions. We propose scalable solutions based on hill climbing and simulated anneal- ing, built on the Apache Flink framework for big data processing. We analyze the scalability of the solution through a series of experiments on large synthetic networks.
Full text paper: http://www.jmlr.org/proceedings/papers/v52/ramos-lopez16.pdf
Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics
Markov Chain Monte Carlo methods are a class of algorithms used to sample from complicated distributions. Typically, this is the case of posterior distributions in Bayesian Networks (Belief Networks).
These slides cover the following topics.
→ Motivation and Practical Examples (Bayesian Networks)
→ Basic Principles of MCMC
→ Gibbs Sampling
→ Metropolis–Hastings
→ Hamiltonian Monte Carlo
→ Reversible-Jump Markov Chain Monte Carlo
This is the presentation of the paper "Quasi-Optimal Recombination Operator" presented in EvoCOP 2019 (Best paper session). The paper is available in LNCS with doi: https://doi.org/10.1007/978-3-030-16711-0_9
In the study of probabilistic integrators for deterministic ordinary differential equations, one goal is to establish the convergence (in an appropriate topology) of the random solutions to the true deterministic solution of an initial value problem defined by some operator. The challenge is to identify the right conditions on the additive noise with which one constructs the probabilistic integrator, so that the convergence of the random solutions has the same order as the underlying deterministic integrator. In the context of ordinary differential equations, Conrad et. al. (Stat.
Comput., 2017), established the mean square convergence of the solutions for globally Lipschitz vector fields, under the assumptions of i.i.d., state-independent, mean-zero Gaussian noise. We extend their analysis by considering vector fields that need not be globally Lipschitz, and by
considering non-Gaussian, non-i.i.d. noise that can depend on the state and that can have nonzero mean. A key assumption is a uniform moment bound condition on the noise. We obtain convergence in the stronger topology of the uniform norm, and establish results that connect this topology to the regularity of the additive noise. Joint work with A. M. Stuart (Caltech), T. J. Sullivan (Free University of Berlin).
A Unifying Review of Gaussian Linear Models (Roweis 1999)Feynman Liang
Through a linear Gaussian process, we can unify a family of Gaussian linear models including Factor Analysis, PCA, Kalman Filters, Mixture of Gaussians, and Hidden Markov Models.
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
We provide a general framework for learning characterization
rules of a set of objects in Geographic Information Systems (GIS) relying
on the definition of distance quantified paths. Such expressions specify
how to navigate between the different layers of the GIS starting from
the target set of objects to characterize. We have defined a generality
relation between quantified paths and proved that it is monotonous with
respect to the notion of coverage, thus allowing to develop an interactive
and effective algorithm to explore the search space of possible rules. We
describe GISMiner, an interactive system that we have developed based
on our framework. Finally, we present our experimental results from a
real GIS about mineral exploration.
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)AMIDST Toolbox
In this paper we perform an exploratory analysis of a finan- cial data set from a Spanish bank. Our goal is to do risk prediction in credit operations, and as data is collected continuously and reported on a monthly basis, this gives rise to a streaming data classification problem. Our analysis reveals some practical problems that have not previously been thoroughly analyzed in the context of streaming data analysis: the class labels are not immediately available and the rele- vant predictive features and entities under study (in this case the set of customers) may vary over time. In order to address these problems, we propose to use a dynamic classifier with a wrapper feature subset selection to find relevant features at di↵erent time steps. The proposed model is a special case of a more general framework that can also ac- commodate more expressive models containing latent variables as well as more sophisticated feature selection schemes.
Full text link: http://www.idi.ntnu.no/~helgel/papers/BorchaniMartinezMasegosaLangsethNielsenSalmeronFernandezMadsenSaezSCAI15.pdf
Finding Your People Story: How to Develop an Employer Brand That Attracts Ta...Chad Norman
Don’t just hire great people… turn them into your best recruiters and salespeople! When making decisions, both candidates and consumers are influenced in similar ways by branding, referrals, references, and reviews - and the journey continues on long past being hired or converting. The Talent Attraction Lifecycle combines employer branding with the recruiting process to create a continuous loop of promotion enabling your employees to become your best recruiters - and ultimately your best brand advocates. This session will show you how to use every stage of the candidate lifecycle to attract talent and tell your people story, from careers webpages to culture-infused job descriptions, from employee referral campaigns to reputation management strategies, from onboarding to thought leadership. When you harness the power of the Talent Attraction Lifecycle, your business not only attracts the candidates you’re looking for, but also the customers who want to work with them. In any knowledge-based or service-oriented business, your employer brand IS your brand, and companies that understand this will win the war for talent AND customers.
Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics
Markov Chain Monte Carlo methods are a class of algorithms used to sample from complicated distributions. Typically, this is the case of posterior distributions in Bayesian Networks (Belief Networks).
These slides cover the following topics.
→ Motivation and Practical Examples (Bayesian Networks)
→ Basic Principles of MCMC
→ Gibbs Sampling
→ Metropolis–Hastings
→ Hamiltonian Monte Carlo
→ Reversible-Jump Markov Chain Monte Carlo
This is the presentation of the paper "Quasi-Optimal Recombination Operator" presented in EvoCOP 2019 (Best paper session). The paper is available in LNCS with doi: https://doi.org/10.1007/978-3-030-16711-0_9
In the study of probabilistic integrators for deterministic ordinary differential equations, one goal is to establish the convergence (in an appropriate topology) of the random solutions to the true deterministic solution of an initial value problem defined by some operator. The challenge is to identify the right conditions on the additive noise with which one constructs the probabilistic integrator, so that the convergence of the random solutions has the same order as the underlying deterministic integrator. In the context of ordinary differential equations, Conrad et. al. (Stat.
Comput., 2017), established the mean square convergence of the solutions for globally Lipschitz vector fields, under the assumptions of i.i.d., state-independent, mean-zero Gaussian noise. We extend their analysis by considering vector fields that need not be globally Lipschitz, and by
considering non-Gaussian, non-i.i.d. noise that can depend on the state and that can have nonzero mean. A key assumption is a uniform moment bound condition on the noise. We obtain convergence in the stronger topology of the uniform norm, and establish results that connect this topology to the regularity of the additive noise. Joint work with A. M. Stuart (Caltech), T. J. Sullivan (Free University of Berlin).
A Unifying Review of Gaussian Linear Models (Roweis 1999)Feynman Liang
Through a linear Gaussian process, we can unify a family of Gaussian linear models including Factor Analysis, PCA, Kalman Filters, Mixture of Gaussians, and Hidden Markov Models.
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
We provide a general framework for learning characterization
rules of a set of objects in Geographic Information Systems (GIS) relying
on the definition of distance quantified paths. Such expressions specify
how to navigate between the different layers of the GIS starting from
the target set of objects to characterize. We have defined a generality
relation between quantified paths and proved that it is monotonous with
respect to the notion of coverage, thus allowing to develop an interactive
and effective algorithm to explore the search space of possible rules. We
describe GISMiner, an interactive system that we have developed based
on our framework. Finally, we present our experimental results from a
real GIS about mineral exploration.
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)AMIDST Toolbox
In this paper we perform an exploratory analysis of a finan- cial data set from a Spanish bank. Our goal is to do risk prediction in credit operations, and as data is collected continuously and reported on a monthly basis, this gives rise to a streaming data classification problem. Our analysis reveals some practical problems that have not previously been thoroughly analyzed in the context of streaming data analysis: the class labels are not immediately available and the rele- vant predictive features and entities under study (in this case the set of customers) may vary over time. In order to address these problems, we propose to use a dynamic classifier with a wrapper feature subset selection to find relevant features at di↵erent time steps. The proposed model is a special case of a more general framework that can also ac- commodate more expressive models containing latent variables as well as more sophisticated feature selection schemes.
Full text link: http://www.idi.ntnu.no/~helgel/papers/BorchaniMartinezMasegosaLangsethNielsenSalmeronFernandezMadsenSaezSCAI15.pdf
Finding Your People Story: How to Develop an Employer Brand That Attracts Ta...Chad Norman
Don’t just hire great people… turn them into your best recruiters and salespeople! When making decisions, both candidates and consumers are influenced in similar ways by branding, referrals, references, and reviews - and the journey continues on long past being hired or converting. The Talent Attraction Lifecycle combines employer branding with the recruiting process to create a continuous loop of promotion enabling your employees to become your best recruiters - and ultimately your best brand advocates. This session will show you how to use every stage of the candidate lifecycle to attract talent and tell your people story, from careers webpages to culture-infused job descriptions, from employee referral campaigns to reputation management strategies, from onboarding to thought leadership. When you harness the power of the Talent Attraction Lifecycle, your business not only attracts the candidates you’re looking for, but also the customers who want to work with them. In any knowledge-based or service-oriented business, your employer brand IS your brand, and companies that understand this will win the war for talent AND customers.
Hong Ooi’s analysis supports bottom line-impacting decisions made a wide spectrum of groups at Australia and New Zealand Banking Group (ANZ). He has broad experience with both SAS and R, and depends on R for the bulk of his analysis. In this webinar, he will discuss his challenges and how he’s using R along with SAS and Excel to overcome them.
Machine Learning (ML) for Fraud Detection.
- fraud is a big problem (big data, big cost)
- ML on bigger data produces better results
- Industry standard today (for detecting fraud)
- How to improve fraud detection!
A simple classification problem based on credit score data which allows us to identify whether a particular loan applicant may be given or denied credit (loan). Using Rattle and R (for some boxplot snippets), we've tried to bring out some interesting insights
In this talk we will describe a methodology to handle the causality to make inference on common-cause failure in a situation of missing data. The data are collected in the form of contingency table but the available information are only the numbers of CCF of different orders and the numbers of failure due to a given cause. Therefore only the margins of the contingency table are observed; thefrequencies in each cell are unknown. Assuming a Poisson model for the count, we suggest a Bayesian approach and we use the inverse Bayes formula (IBF) combined with a Metropolis-Hastings algorithm to make inference on the rate of occurrence for the different combination cause, order. The performance of the resulting algorithm is evaluated through simulations. A comparison is made with results obtained from the _-composition approach to deal with causality suggested by Zheng et al. (2013).
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. In this talk I will describe two of our recent contributions to this topic. First, we highlight an important technical issue present in a large fraction of the recent convergence proofs for asynchronous parallel optimization algorithms and propose a new framework that resolves it [1]. Second, we propose a novel asynchronous variant of SAGA, a stochastic method that combines the low cost per iteration of SGD with the fast convergence rates of gradient descent [2]
[1] Leblond, R., Pedregosa, F., & Lacoste-Julien, S. (2018). Improved asynchronous parallel optimization analysis for stochastic incremental methods. arXiv:1801.03749, https://arxiv.org/pdf/1801.03749.pdf
[2] Pedregosa, F., Leblond, R., & Lacoste-Julien, S. (2017). Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization. In Advances in Neural Information Processing Systems, http://papers.nips.cc/paper/6611-breaking-the-nonsmooth-barrier-a-scalable-parallel-method-for-composite-optimization.pdf
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
The introduction of expert knowledge when learning Bayesian Networks from data is known to be an excellent approach to boost the performance of automatic learning methods, specially when the data is scarce. Previous approaches for this problem based on Bayesian statistics introduce the expert knowledge modifying the prior probability distributions. In this study, we propose a new methodology based on Monte Carlo simulation which starts with non-informative priors and requires knowledge from the expert a posteriori, when the simulation ends. We also explore a new Importance Sampling method for Monte Carlo simulation and the definition of new non-informative priors for the structure of the network. All these approaches are experimentally validated with five standard Bayesian networks.
Read more:
http://link.springer.com/chapter/10.1007%2F978-3-642-14049-5_70
In classical data analysis, data are single values. This is the case if you consider a dataset of n patients which age and size you know. But what if you record the blood pressure or the weight of each patient during a day ? Then, for each patient, you do not have a single-valued data but a set of values since the blood pressure or the weight are not constant during the day.
Suppose now that you do not want to record blood pressure a thousand times for each patient and to store it into a database because your memory space is limited. Therefore, you need to aggregate each set of values into symbols: intervals (lower and upper bounds only), box plots, histograms or even distributions (distribution law with mean and variance)...
Thus, the issue is to adapt classical statistical tools to symbolic data analysis. More precisely, this article is aimed at proposing a method to fit a regression on Gaussian distributions. This paper is divided as follows: first, it presents the computation of the maximum likelihood estimator and then it compares the new approach with the usual least squares regression.
One approach to understanding the phase structure of QCD
at finite densities is to map the theory onto a simpler theory,
described by an effective Polyakov line action, and then
to solve for the phase structure of that theory by whatever
means may be available. At strong couplings and heavy
quark masses the effective theory can be obtained by a strong coupling/
hopping parameter expansion, and such expansions
have been carried out to rather high orders. These methods
do not seem appropriate for weaker couplings and light
quark masses, and a numerical approach of some kind seems
unavoidable. There are, of course, methods aimed directly at
the lattice gauge theory, bypassing the effective theory. These
include the Langevin equation and Lefshetz thimbles.
In this article, however, we are concerned with deriving the
effective Polyakov line action numerically, and solving the resulting
theory at non-zero chemical potential by a mean field
technique. In the past we have advocated a “relative weights”
method, reviewed below, to obtain the effective theory,
but thus far this method has only been applied to pure gauge
theory, and to gauge theory with scalar matter fields. Here we
would like to report some first results for SU(3) lattice gauge
theory coupled to dynamical staggered fermions.
The effective Polyakov line action (PLA) of a lattice gauge
theory is the theory which results from integrating out all of
the degrees of freedom of the theory, subject to the condition
that the Polyakov lines are held fixed, and it is hoped
that this effective theory is more tractable than the underlying
lattice gauge theory (LGT) when confronting the sign problem
at finite density. The general idea was pioneered in,
and the derivation of the PLA from the underlying LGT has
been pursued by various methods. The relative
weights method is a simple numerical technique for finding
the derivative of the PLA in any direction in the space of
Polyakov line holonomies.1 Given some ansatz for the PLA,
depending on some set of parameters, we can use the relative
weights method to determine those parameters. Then, given
the PLA at some fixed temperature T, we can apply a mean
field method to search for phase transitions at finite chemical
potential m. This is the strategy which we have outlined
in some detail in, where some preliminary results for finite
densities were presented. The relative weights method
has strengths and weaknesses; on the positive side the approach
is not tied to either a strong coupling or hopping parameter
expansion, and the non-holomorphic character of the
fermion action is irrelevant. The main weakness is that the validity
of the results depends on a good choice of ansatz for the
PLA. We have suggested, for exploratory work, an ansatz for
the PLA inspired first by the success of the relative weights
method applied to pure gauge theories, and secondly by
the form of the PLA obtained for heavy-dense quarks.
A quantum-inspired optimization heuristic for the multiple sequence alignment...Konstantinos Giannakis
Slides from the presentation of "A quantum-inspired optimization heuristic for the multiple sequence alignment problem in bio-computing" in IISA 2019 conference.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
1. d-VMP:
Distributed Variational Message Passing
Andrés R. Masegosa1, Ana M. Martínez2,
Helge Langseth1, Thomas D. Nielsen2, Antonio Salmerón3,
Darío Ramos-López3, Anders L. Madsen2,4
1Department of Computer Science, Aalborg University, Denmark
2 Department of Computer and Information Science,
The Norwegian University of Science and Technology, Norway
3Department of Mathematics, University of Almería, Spain
4 Hugin Expert A/S, Aalborg, Denmark
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 1
6. Motivation
Goal: learn a generative model for a finantial dataset to monitor the
customers and make predictions for a single customer.
Xi Hi
θα
i = 1, . . . , N
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 6
7. Popular existing approach: SVI
Stochastic Variational Inference: iteratively updates the model
parameters based on subsampled data batches.
No estimation of all local hidden variables of the model.
No generation of lower bound.
Poor fit if batch of data is not representative from all data.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 7
8. Our contribution:
d-VMP: a distributed message passing scheme.
Defined for a broader class of models (than SVI).
Better and faster convergence results compared to SVI.
Posterior over all local latent variables and lower bound.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 8
10. Models:
Bayesian learning on iid. data using conjugate exponential BN models:
ln p(X) = ln hX + sX · η − AX (η)
Xi Hi
θα
i = 1, . . . , N
We want to calculate p(θ, H|D).
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10
11. Variational Inference:
Approximate p(θ, H|D) (often intractable) by finding tractable
posterior distributions q ∈ Q by minimizing:
min
q(θ,H)∈Q
KL(q(θ, H)|p(θ, H|D)),
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11
12. Variational Inference:
Approximate p(θ, H|D) (often intractable) by finding tractable
posterior distributions q ∈ Q by minimizing:
min
q(θ,H)∈Q
KL(q(θ, H)|p(θ, H|D)),
In the mean field variational approach, Q is assumed to fully
factorize:
q(θ, H) =
M
k=1
q(θk)
N
i=1
J
j=1
q(Hi,j ),
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 12
13. Variational Inference:
Variational Inference exploits:
ln P(D)
constant
= L(q(θ, H))
Maximize
+ KL(q(θ, H)|p(θ, H|D))
Minimize
,
Iterative coordinate ascent of the variational distributions.
Updates in the variational distribution of a variable only
involves variables in its Markov blanket.
Coordinate ascent algorithm formulated as a message passing
scheme.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 13
14. Variational Message Passing, VMP:
Message from parent to child: moment parameters
(expectation of the sufficient statistics).
Message from child to parent: natural parameters
(based on the messages received from the co-parents).
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 14
16. Distributed optimization of the lower bound:
Master
θα
q(t)(θ)
X
H
θ1
i = 1, . . . , N
Slave 1
q(t)(H1)
X
H
θ2
i = 1, . . . , N
Slave 2
q(t)(H2)
X
H
θ3
i = 1, . . . , N
Slave 3
q(t)(H3)
q(t)(θ) is broadcasted to all the slave nodes.
maxq∈Q L(q(θ, H))
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 16
17. Distributed optimization of the lower bound:
Master
θα
q(t)(θ)
X
H
θ1
i = 1, . . . , N
Slave 1
q(t)(H1)
X
H
θ2
i = 1, . . . , N
Slave 2
q(t)(H2)
X
H
θ3
i = 1, . . . , N
Slave 3
q(t)(H3)
q(t+1)(H) = arg maxq(H) L(q(H), q(t)(θ))
maxq∈Q L(q(θ, H))
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 17
18. Distributed optimization of the lower bound:
Master
θα
q(t)(θ)
X
H
θ1
i = 1, . . . , N
Slave 1
q(t)(H1)
X
H
θ2
i = 1, . . . , N
Slave 2
q(t)(H2)
X
H
θ3
i = 1, . . . , N
Slave 3
q(t)(H3)
q(t+1)(Hn) = arg maxq(Hn) Ln(q(Hn), q(t)(θ))
maxq∈Q L(q(θ, H))
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 18
19. Distributed optimization of the lower bound:
Master
θα
q(t)(θ)
X
H
θ1
i = 1, . . . , N
Slave 1
q(t)(H1)
X
H
θ2
i = 1, . . . , N
Slave 2
q(t)(H2)
X
H
θ3
i = 1, . . . , N
Slave 3
q(t)(H3)
q(t+1)
(θ) = arg max
q(θ)
L(q(t)
(H), q(θ))
May be coupled
maxq∈Q L(q(θ, H))
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 19
20. Candidate solutions:
Resort to a generalized mean-field approximation as SVI: does
not factorize over the global parameters.
Prohibitive for models with a large number of global (coupled)
parameters, e.g. linear regression.
Our proposal: VMP as a distributed projected natural
gradient ascent algorithm (PGNA).
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 20
21. d-VMP as a projected natural gradient ascent
Insight 1: VMP can be expressed as a projected natural
gradient ascent algorithm.
η
(t+1)
X = η
(t)
X + ρX,t[ ˆ ηL(η(t)
)]+
X (1)
[·] is the projection operator.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 21
22. d-VMP as a projected natural gradient ascent
Insight 2: The natural gradient of the lower bound can be
expressed as follows:
ˆ ηθ
L = mPa(θ)→θ + mHi →θ
The gradient can be computed in parallel.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 22
23. d-VMP as a projected natural gradient ascent
Insight 3: Global parameters are “coupled” only if they belong
to each other’s Markov blanket.
Define a disjoint partition of the global parameters:
R = {J1, . . . , JS }
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 23
24. d-VMP as a projected natural gradient ascent
d-VMP is based on performing independent global updates
over the global parameters of each partition:
η
(t+1)
Jr
= η
(t)
Jr
+ ρr,t[ ˆ ηL(η(t)
)]+
Jr
ρr,t is the learning rate. If |Jr | = 1 then ρr,t = 1.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 24
25. dVMP as a distributed PNGA algorithm:
Master
θα
η
(t+1)
Jr
X
H
θ1
i = 1, . . . , N
Slave 1
η
(t+1)
1
X
H
θ2
i = 1, . . . , N
Slave 2
η
(t+1)
2
X
H
θ3
i = 1, . . . , N
Slave 3
η
(t+1)
3
η
(t)
Jr
+ ρr,t[ˆηL(η(t)
)]+
Jr
For all Jr ∈ R
maxq∈Q L(q(θ, H))
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 25
27. Model fit to the data
Xij HijYi
Hi θ
α
j = 1, . . . , J
i = 1, . . . , N
Attribute range
Density
0 1 2 3 4 5 6
0123456
●
●
●
●
SVI 1%
SVI 5%
SVI 10%
VMP 1%
Representative sample of 55K clients (N) and 33 attributes (J).
“Unrolled” model of more than 3.5M nodes (75% latent variables).
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 27
28. Model fit to the data
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 28
29. Model fit to the data
0 500 1000 1500 2000
−1.5e+07−1.0e+07−5.0e+060.0e+00
Time (seconds)
Globallowerbound
Alg. BS(data%)/LR
SVI 1%/0.55
SVI 1%/0.75
SVI 1%/0.99
SVI 5%/0.55
SVI 5%/0.75
SVI 5%/0.99
SVI 10%/0.55
SVI 10%/0.75
SVI 10%/0.99
d−VMP
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 29
31. Mixtures of learnt posteriors for one attribute
●
●
●
●
SVI 1%
SVI 5%
SVI 10%
VMP 1%
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 31
32. Scalability settings
Generated data set of 42 million samples per client and 12
variables.
“Unrolled” model of more than 1 billion (109) nodes
(75% latent variables).
AMIDST Toolbox with Apache Flink.
Amazon Web Services (AWS) as distributed computing
environment.
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 32
35. Conclusions
Variational methods can be scaled using distributed
computation instead of sampling techniques.
Bayesian learning in model with more than 1 billion nodes
(75% of hidden).
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 35
36. Thank you for your attention
Questions?
You can download our open source Java toolbox:
amidsttoolbox.com
Acknowledgments: This project has received funding from the European Union’s
Seventh Framework Programme for research, technological development and
demonstration under grant agreement no 619209
Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 36