proposal_pura

Random Feature Selection for Online Gaussian
Process-Based Learning
Erick Lin, Dr. Byron Boots
School of Interactive Computing
Background
Gaussian processes (GP) are a powerful class of models that are increasingly being used
in applications as diverse as control systems engineering, geostatistics, and medical image
analysis. A Gaussian process is defined as a collection of random variables, any finite number
of which have a joint Gaussian (also called normal) distribution. Informally, GP can be
thought of as the infinite-dimensional version of a multivariate Gaussian distribution, where
every point x in the domain is associated with a Gaussian random variable f(x) ∼ N(µ, σ2
).
This means that a Gaussian process may be specified by a mean function µ(x) and a kernel
function k(x, x ) which gives the covariance between the random variables at any two points
x and x , and overall can be represented as
f(x) ∼ GP µ(x), k(x, x ) . (1)
Gaussian processes can be used to perform supervised learning, or the general task
of inferring unknown outputs for test inputs based on previously observed training data
(input-output pairs). We focus primarily on models for regression, which are supervised
learning models that produce real-valued outputs. From a statistical perspective, Gaussian
process regression (GPR) is closely related to Bayesian linear regression (BLR) in that the
distribution of hypotheses is conditioned on the training data using Bayesian inference.
Their connection is further illustrated through the derivation of GPR from BLR in the
following paragraph.
The standard linear regression model takes the form y = f(x) + ε with f(x) = x w,
where w is the parameter vector, f is the hypothesis, and y is the observed output. If we
make the assumption that w ∼ N(0, Σp) and ε ∼ N(0, σ2
n), then it can be shown that
p(w|y, X) ∼ N(
1
σ2
n
A−1
Xy, A−1
) (2)
where A = σ−2
n XX + Σ−1
p , and as a result, the predictive distribution f∗ at the test input
x∗ for BLR is given by
p(f∗|x∗, y, X) = p(f∗|x∗, w)p(w|y, X)dw (3)
= N(
1
σ2
n
x∗ A−1
Xy, x∗ A−1
x∗). (4)
If x is now replaced with a function φ(x) that maps x into a higher-dimensional feature
space, the model f(x) = φ(x) w becomes capable of fitting nonlinear functions, and the
predictive distribution becomes
p(f∗|x∗, y, X) = N
1
σ2
n
φ(x∗) A−1
Φ(X)y, φ(x∗) A−1
φ(x∗) (5)
where Φ(X) is the matrix of φ(x) for all elements x in the training set. If the image of φ is
infinite-dimensional, then (5) expresses Gaussian process regression in a specific form that
is useful for our purposes.
1

Gaussian processes can be considered nonparametric – that is, the number of parameters
scales up with the number of data points, allowing for flexibility in fitting to data on large
scales. However, in practice, traditional GPR has not been able to be applied in such scenar-
ios due to its heavy computational requirements, which include O(N3
) time complexity and
O(N2
) space complexity in the number of training points N, with the respective bottleneck
factors being the inversion of a covariance matrix K for all pairs of training points and the
storage of K [2].
One proposed solution is that instead of using an infinite or high-dimensional feature
mapping φ (in fact, k(x, x ) = φ(x), φ(x ) ), a randomized low-dimensional feature mapping
z(x) z(x ) can be used instead that approximates k(x, x ) to a high accuracy, while enabling
the use of the much faster and computationally less expensive BLR to approximate GPR in
the random feature space [1].
Objectives
Our goal is to combine the approaches outlined in the previous section with original ap-
proaches of our own to produce a novel variant of GPR able to predict and make decisions
based on data that is both potentially massive in scale and streaming in real-time.
For handling the latter characteristic in particular, we may take hint from Bayesian
linear regression, which has a recursive solution [3] that allows it to be performed online, or
updated with new training data without having to retrain on the previous data. A general
outline of online Bayesian linear regression is that we begin with a prior p(w) where the
prior distribution is given by w ∼ N(0, Σp), and recursively apply the Bayes filter
p(w|y1, · · · , yk, Xk) =
p(yk|w, Xk)
p(y1|X1) · · · p(yk|Xk)
p(w|y1, · · · , yk−1, Xk−1) (6)
once for each incoming training output yk, where Xk is the matrix consisting of the first
k training input vectors. It can be shown that the posterior itself has a Gaussian dis-
tribution, and allows us to also obtain a Gaussian distribution for the predicted output
p(f∗|x∗, y1, · · · , yk, X) of any test input x∗. We propose that the same procedure can be
repeated when xk is replaced with φ(xk) and Xk with Φ(Xk) to obtain the corresponding
distribution function for Gaussian process regression that has the desired characteristic of
being able to perform online learning.
My plan is to first develop the mathematical details of and ensure correctness of our
proposed learning model, which attempts to synthesize the methods of selecting random
features and solving BLR recursively to create a fast online approximate GPR, and making
modifications or adopting further ideas as necessary.
I will then implement the GPR model as a program and begin testing it on an inverse
dynamics problem in the form of motion planning for a 7-DOF robotic arm, comparing its
performance to those of established benchmarks.
If time permits, I will also work toward implementing automated selection of hyperpa-
rameters (free parameters) using techniques of maximizing likelihood and/or cross valida-
tion.
Materials and Methods
Octave/MATLAB will be used to implement our GPR algorithm. Personal desktop com-
puters running Ubuntu and/or Windows will be used to run the learning algorithm on the
training and test datasets, and we will make use of a Barrett TechnologyR
Inc. WAMTM
Arm
in our lab.
2

References
[1] Rahimi, A., and Recht, B. Random features for large-scale kernel machines. In
Neural Information Processing Systems (2007).
[2] Rasmussen, C. E., and Williams, C. K. Gaussian Processes for Machine Learning.
The MIT Press, Cambridge, Massachusetts, 2006.
[3] Särkkä, S. Lecture 2: From linear regression to kalman filter and beyond, January
2012.
3

proposal_pura

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to proposal_pura

Similar to proposal_pura (20)

proposal_pura