1. My name is David Thompson, and this
presentation is Local Methods for Pattern
Recognition.
It's the first of several modules that
will involve, or talk about
dimensionality reduction in the context of
this Caltech Big Data Summer School.
And so, the first of these lectures is
intended to provide just
a common reference some basic skill set
that we'll refer to later.
I'm going to review some basic
pattern recognition strategies, such as
classification regression
that you will have been exposed to
previously, but I think it's, it's
worth talking about them again, because it
really is foundational for some of
the things that we'll describe later
as far as dimensionality reduction is
concerned.
So it's an important background to have.
Okay, so this first talk will, as I
said, review basic pattern recognition
problems, classification, and regression.
And in particular on the context of local
pattern recognition strategies.
These are non-parametric pattern
recognition strategies, as opposed
to say parametric methods that we have
seen earlier.
A lot of these fall into the general
category of nearest neighbor methods.
So I'll describe the nearest-neighbor
methodology, and why that might be a
good thing, and then a couple variants of
that, including local linear regression.
Which is a flexible regression strategy
that
is based on a local non-KerMetric
approach.
And then kernel density estimation, which
is a probabilistic method for pattern
recognition.
Okay, so let's start off with a simple
pattern recognition task.
Here I've got a, a data set with a couple
of
attributes on it, and is pattern
recognition folks I want to do.
I just plotted these giving each its own
coordinates.
So here, this is a two-dimensional
attribute space where
every data point may have associated with
it some
class value or some real valued ordinate
that we're
regressing against, some value that we'd
want to predict.
[INAUDIBLE] were just interesting in
intrinsic properties of the data set how
the data is distributed to try to infer
something about the process that
2. generated that data, so that is more akin
to say a density estimation
task where we are estimating probability
density in this two dimensional attribute
space.
So there are lots of different kinds of
pattern recognition questions that
we can ask of this data set even this very
simple one here.
So, I've plotted here on this, this slide
a red question mark to indicate
some query point where we'd like to
describe or predict behavior of this
process.
Again, this could be a, a prediction of
the
real valued function at that location in
the input
space, or it could be that we have a
new point there, who's class we're tying
to infer.
Regardless we've got a, a query point and
we want to be
able to infer something about what the
process is doing at that location.
All right, a little bit of notation
background.
So I'm going to treat these data points as
column vectors, so
we're going to say X is, er, X sub I is
our data point.
And it has from one to n attributes.
Again, that's example, a simple example
only has
two, but you can imagine having
arbitrarily many.
And because this is dimensionality
reduction, we're eventually going to be
talking
about data points that have hundreds or
even thousands of attributes.
And we can represent the entire data set
as
a matrix where we stack these column
vectors together
in rows so we have an N by d matrix, where
D is the number of data points.
For now without much loss of generality,
I'm just going
to assume that all of these attributes are
real valued.
Continuous attributes so, I'm not going to
deal much with, for example, categorical
attributes,
but many of the same principles that
I'll discuss apply to categorical
attributes as well.
And you may be familiar with encoding
strategies that
you can use to turn categorical attributes
into continuous.
So really we'll just focus on the, the the
continuous case here.
Okay, so, or one reasonable way to do
3. pattern recognition on this data set is to
just
look at the local behavior in the vicinity
of the query point that we're interested
in inferring.
So here we can, this goes under the
assumption that whatever
process generated the data is locally
smooth in the input space.
So maybe we can disregard a bunch of the
points that are really far away from our
query, right.
We don't have to look at the function
values.
it, it, at the extreme lower left side of
a plot.
We can just look at the, the values on
the upper right near the, the red question
mark.
Right?
So, maybe to look at its, its closest
neighbors
to see what the function is doing in that
vicinity.
And you can do this for each
of the pattern recognition strategies that
I mentioned.
All right.
So the canonical example is, of course,
nearest neighbor classification.
To infer the class of the query, just look
to its nearest neighbor in our
data set, and assume that it has the same
class as that one neighbor point.
So, in this case I've indicated, with a
red arrow, the nearest neighbor of the
query and so we just assume that that is
the class of our, of our query.
Now this amounts to a [UNKNOWN]
partitioning of
the input space so here in two dimensions
I've drawn, I've partitioned the space
drawing areas
for which every input point is responsible
right?
So you can see that the nearest neighbor
points
are actually responsible for a polygon,
within this input space.
And any query within that polygon is going
to get its class now this has kind of
nice property, because the data that is
where
we have denser data right in the center.
it, where we, it, have more information
that the Voronio partitions are smaller,
right?
And we permit the function to vary a
little bit more.
Far from the data cloud these, these
partitions get a lot larger and, again,
we,
we don't have as much to say about that
corner of, of the input space.
4. So our inference there is going to be
smoother.
Right, so this is a basic example of
nearest
neighbor classification, and the decision
boundary looks like this.
This is he resulting decision boundary.
I've applied here some, some labels to the
data set, red and, and blue.
And you can see that the decision boundary
here drawn in black is sort
of a wiggly shape through this input space
that bends around all of these inputs.
Now the, the query's actually been paired
with a blue point.
This is kind of interesting, though
because
if you look carefully you'll note that
the query is actually also fairly close to
a bunch of red points as well.
Right?
To it's immediate left there are clusters
of red points and it's almost as close
to those, but it's just sort of
happenstance
that it got paired with a blue point.
So another way of saying this is that the
one nearest neighbor approach to
classification is rather sensitive noise.
It has high variance, in the language of
pattern recognition.
Variance-bias tradeoff, is definitely well
it's definitely
more variance than bias in this case.
So this also manifests as a fairly wiggly
decision boundary.
So if you see in the center of the cloud
there, maybe our decision boundary wiggles
a little bit more than is appropriate
given,
given the, the amount of data that we
have.
So what this really needs is some way to
smooth that decision
boundary or regularize it so we're less
sensitive to those single point outliers.
And the way that's typically done is by
introducing more nearest neighbors.
So here I'm looking at the five nearest
neighbors and taking
a majority vote of which which class to
call the query.
And you can see the decision boundary
straighten out quite a bit asregularization that we
saw before both with the increase in the
number of
[INAUDIBLE] we can do exactly the same
thing here note
that the blue might even be a bit over
smooth.
Looks right, so we're doing a pretty good
job of modeling the underlying
galaxy influction, but we've started to
truncate its peak a little bit, right?
5. So maybe a width of point three is a
little bit too much.
One thing to note about this expression,
note that it's normalized to provide
a true probability density estimate, so
it's
a valid PDF so that the normalization.
Factor Z, zed normalizes our kernel
function so
that all the kernel evaluations have a
volume
one or area one e, the cross validation is the, the
typical way that one would estimate these,
these parameters.
But, fortunately there's really only one
parameter
to estimate which is the, in this case.
The kernel width.
So we get a wide range of very
flexible functions out of very few input
parameters.
We're letting the data speak for itself
unlike a parametric method.
And this is the, the real power of these
local methods for pattern recognition.
Okay.
So, in summary I've described some
local nonparametric methods for pattern
recognition.
This is in contrary to parametric methods
that imply some
global functional form for the process
that generates the data here.
We're just looking at local patches of
data to infer the
values of the underlying function or
process at some query point.
And there are 3 different examples of
this local pattern recognition that I
demonstrated.
K-nearest neighbor for classification.
Kernel smoothing and local linear
regression for regression problems
and then kernel density estimation for
probability density estimation tasks.
In order, the main parameter is
regularization and
you can set this using a cross validation
strategy.
6. So maybe a width of point three is a
little bit too much.
One thing to note about this expression,
note that it's normalized to provide
a true probability density estimate, so
it's
a valid PDF so that the normalization.
Factor Z, zed normalizes our kernel
function so
that all the kernel evaluations have a
volume
one or area one e, the cross validation is the, the
typical way that one would estimate these,
these parameters.
But, fortunately there's really only one
parameter
to estimate which is the, in this case.
The kernel width.
So we get a wide range of very
flexible functions out of very few input
parameters.
We're letting the data speak for itself
unlike a parametric method.
And this is the, the real power of these
local methods for pattern recognition.
Okay.
So, in summary I've described some
local nonparametric methods for pattern
recognition.
This is in contrary to parametric methods
that imply some
global functional form for the process
that generates the data here.
We're just looking at local patches of
data to infer the
values of the underlying function or
process at some query point.
And there are 3 different examples of
this local pattern recognition that I
demonstrated.
K-nearest neighbor for classification.
Kernel smoothing and local linear
regression for regression problems
and then kernel density estimation for
probability density estimation tasks.
In order, the main parameter is
regularization and
you can set this using a cross validation
strategy.