1. PhD’s Research Proposal
Title: Bayesian Inference for Big Data with Stochastic MCMC and Variational Bayesian
Author: Komlan ATITEY
Abstract—This PhD Research Proposal discusses the research project that the author will work
on for his PhD dissertation. We consider the problem of big data for multi-target tracking in the
presence of an unknown number of targets. This research provides a Bayesian framework for
making Big Data inferences, based on conceptualized transformation, sampling and censoring
processes applied to the Big Data measurements. Proper inference will require modeling of all
processes, which can be very complex, if at all possible. However, where certain sampling and
censoring ignorability conditions are fulfilled, inference can be made on the Big Data
measurements as if they are acquired from a random sample.
BACKGROUND
MULTITARGET tracking has a long history spanning over 50 years and it refers to the problem
of jointly estimating the number of targets and their states from sensor data. Today, multitarget
tracking has found applications in diverse disciplines, including, air traffic control, intelligence,
surveillance, and reconnaissance (ISR), space applications, oceanography, autonomous vehicles
and robotics, remote sensing, computer vision, and biomedical research. During the last decade,
advances in multitarget tracking techniques, along with sensing and computing technologies,
have opened up numerous research venues as well as application areas. As the statistical models
used to comprehend complex systems grow, the strategies used to fit these models must scale
accordingly. While progressed computational strategies are being created to fit these complex
models, their velocity and memory requirements regularly request colossal computational force
via large clusters. This approach of relying on big data and high dimensional systems are rapidly
getting to be unsustainable, especially for professionals for whom these assets are not accessible.
Accordingly, there is a substantial and developing requirement for statistically efficient strategies
which scale in terms of speed and memory while being straightforward to implement and
communicate.
PROBLEM STATEMENT
In the field of multiple target tracking with the advances of sensor technology, it is possible to
collect large amount of real time observation data from real systems during simulations.
Inaccurate simulation results are often inevitable due to imperfect model and inaccurate inputs.
Bayesian analysis is a standout amongst the best group of methods for analyzing information
(data), and one now widely adopted in the statistical sciences as well as in Artificial intelligence
(AI) technologies like machine learning. The Bayesian approach offers various alluring points of
interest over different techniques: adaptability in constructing complex models from simple parts;
completely coherent inferences from data; natural incorporation of prior knowledge; explicit
modeling assumptions; exact thinking of vulnerabilities over model request and parameters; and
assurance against overfitting.
2. On the other hand, there is a general perception that Bayesian approach can be too slow to be
practically useful on big data sets. This is because exact Bayesian computations are typically
intractable, so a range of more practical approximate algorithms are needed, including
Variational approximations, sequential Monte Carlo (SMC) and Markov Chain Monte Carlo
(MCMC). Unfortunately, MCMC methods do not scale well to big data sets, since they require
many iterations to reduce Monte Carlo noise, and each iteration already involves an expensive
sweep through the whole data set.
PREVIOUS RESEARCH
For such big data’s problems, Scott et al. [1] has argue that the communication between large
numbers of machines is expensive (regardless of the amount of data being communicated), so
there is a need for algorithms that perform distributed approximate Bayesian analyses with
minimal communication. The paper by Mihaylova et al. [2] presents the various aspects of the
problems of group and extended object tracking, underlying difficulties, and the key factors
facilitating their solution in the context of Bayesian estimation. They have presented methods for
small groups and for large groups including MCMC methods, the random matrices approach and
Random Finite Set Statistics methods. MCMC methods arguably form the most popular class of
Bayesian computational techniques, due to their flexibility, general applicability and asymptotic
exactness. The work by Korattikara et al. [3] enlightened MCMC methods and showed the need
to develop an approximation related to the Metropolis-Hastings algorithm for Bayesian posterior
sampling. Next, the paper by Gelman et al. [4] considered the expectation propagation (EP) as a
prototype for scale algorithms that partition big data sets into many parts and analyze each part in
parallel to perform inference of shared parameters. Furthermore, EP iteratively approximates the
moments of the titled distributions and incorporates those approximations into a global posterior
approximation.
APPROACH TO PROBLEM
Usually, taking more data into account and considering high dimensional systems improve a
model's performance. In this project we propose to develop the theoretical foundations for a new
class of MCMC inference strategies that can scale to billions of data items, in this way opening
the qualities of Bayesian methods for big data. The essential thought is to utilize a small subset
of the information (data) during each parameter update iteration of the algorithm, so that many
iterations can be performed easily.
Our proposal is to lay the mathematical foundations for understanding the theoretical
properties of such stochastic MCMC algorithms, and to build on these foundations to develop
more sophisticated algorithms. We aim to comprehend the conditions under which the algorithm
is ensured to converge, and the sort and speed of convergence. Using this understanding, we
intend to develop algorithmic extensions and generalizations with better convergence properties,
including preconditioning, Sequential Monte Carlo methods, Online Bayesian learning methods,
and approximate methods such as Variational Bayesian with large step sizes. These algorithms
will be empirically validated on real world problems, including large scale data analysis
problems for text processing and collaborative filtering.
3. RESEARCH PLAN
The plan for this research project is the following:
1) Review the extant literature on potential function methods and swarms. This will
include a review of the previous work done on big data problems.
2) Develop theoretical foundations for a new class of MCMC inference strategies that can
scale to big data items, in this way opening the qualities of Bayesian methods for big
data.
3) Make a review to understand the conditions under which the algorithm converge and
build on these foundations to develop more sophisticated algorithms.
4) The performance of our algorithms will be based on the use of Sequential Monte Carlo
methods, Online Bayesian learning methods, and approximate methods such as
Variational Bayesian.
5) If possible, the performance of our algorithms will be proved in real experiences to test
and verify each of the used methods and parameters.
6) Write up the results of the study in the form of a PhD dissertation.
7) Research papers will be written and published in peer-reviewed journals.
This research will be conducted during the first three semesters’ researches period.
AUTHOR’S PREVIOUS RESEARCH
In the previous relevant research, the author have studied and made experiments on the
Probability Hypothesis Density (PHD) filter to get its application’s ways in target tracking
process. Furthermore, the author implemented the closed form solution of PHD recursion: the
Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. The research has led to
determine the different drawbacks of GM-PHD filter which are: it lost performance when the
number of targets grow and when the trajectories of targets become more closed. Related to these
drawbacks, the author has developed a novel prediction algorithm in GM-PHD filter called the
Gamma Gaussian Mixture Probability Hypothesis Density (GaGM-PHD) filter for the innovation
of GM-PHD filter. The comparisons between the implementations of the new algorithm and the
existing GM-PHD filter have shown the innovation realized.
The author’s algorithm was original, effective and impactful and the result was presented at
the 8th International Conference on Image and Graphics (ICIG 2015), organized by China
Society of Image and Graphics and Microsoft Research Asia (MSRA) hosted in Tianjin, China.
The author’s paper was published by Springer and indexed by Engineering village (EI) with
Accession number: 20154201380467.
REFERENCES
[1] Scott et al. “Bayes and big data: The consensus Monte Carlo algorithm,” in EFaB Bayes
250 Conf., vol. 16, 2013.
[2] Mihaylova et al., “Overview of Bayesian sequential Monte Carlo methods for group and
extended object tracking," Elsevier, Digital Signal Processing 25 (2014) pp 1-16
[3] Korattikara et al., “Austerity in MCMC land: Cutting the Metropolis-Hastings Budget," in
Proc. of the Int. Conf. on Machine Learning, 2014.
[4] Gelman et al., Cunningham, “Expectation propagation as a way of life”.
preprint, http://arxiv.org/abs/1412.4869, 2014.