Prototype-based models in machine learning

Michael Biehl
Johann Bernoulli Institute for
Mathematics and Computer Science
University of Groningen
www.cs.rug.nl/biehl
Prototype-based models
in machine learning

Brain Inspired Computing - BrainComp, Cetraro, June 2017 2
review: WIRES Cognitive Science (2016)

overview
1. Introduction / Motivation
prototypes, exemplars
neural activation / learning
3. Supervised Learning
Learning Vector Quantization (LVQ)
Adaptive distances and Relevance Learning
(Examples: three bio-medical applications)
2. Unsupervised Learning
Vector Quantization (VQ)
Competitive Learning in VQ and Neural Gas
Kohonen’s Self-Organizing Map (SOM)
4. Summary

1. Introduction
prototypes, exemplars:
representation of information in terms of
typical representatives (e.g. of a class of objects),
much debated concept in cognitive psychology
neural activation / learning:
external stimulus to a network of neurons
response acc. to weights (expected inputs)
best matching unit (and neighbors)
learning -> even stronger response to the same stimulus in future
weights represent different expected stimuli (prototypes)

even independent from the above:
attractive framework for machine learning based data analysis
- trained system is parameterized in the feature space (data)
- facilitates discussions with domain experts
- transparent (white box) and provides insights into the
applied criteria (classification, regression, clustering etc.)
- easy to implement, efficient computation
- versatile, successfully applied in many different application areas

2. Unsupervised Learning
Some potential aims:
dimension reduction:
- compression
- visualization for human insight
- principal {independent} component analysis
exploration / structure detection:
- clustering
- similarities / dissimilarities
- source identification
- density estimation
- neighborhood relation, topology
pre-processing for further analysis
- supvervised learning, e.g.
classification, regression, prediction

Brain Inspired Computing - BrainComp, Cetraro, June 2017
based on dis-similarity/distance measure
assignment to prototypes:
given vector xμ , determine winner
→ assign xμ to prototype w*
one popular example: (squared) Euclidean distance
Vector Quantization (VQ)
VQ system: set of prototypes
data: set of feature vectors
Vector Quantization: identify (few) typical representatives of data
which capture essential features

random sequential (repeated) presentation of data
… the winner takes it all:
initially: randomized wk, e.g. in randomly selected data points
competitive learning
η (<1): learning rate, step size of update
comparison:
K-means: updates all prototypes, considers all data at a time,
EM for Gaussian mixtures in the limit of zero width
competitive VQ: updates only the winner, random sequ. presentation
of single examples (stochastic gradient descent)

quantization error
here:
Euclidean distance
competitive VQ (and K-means) aim at optimizing a cost function:
- assign each data to closest prototype
- measure the corresponding (squared) distance
quantization error (sum over all data points)
measures the quality of the representation
defines a (one) criterion to evaluate / compare
the quality of different prototype configurations
 
1 for x 0
Θ =
0 else
x




VQ and clustering
ideal clustering
scenario:
well-separate,
spherical clusters
in general:
representation
of observations
in feature space
sensitive to
cluster shape,
coordinate
transformations
(even linear)
small clusters
irrelevant with
respect to quan-
tization error
Remark 1: VQ ≠ clustering
minimal quantization error:

VQ and clustering
Remark 2: clustering is an ill-defined problem
“obviously three clusters” “well, maybe only two?”
our criterion: lower HVQ higher HVQ
→ “ better clustering ” ???

→ “ the best clustering ” ?
HVQ = 0
K=1
the simplest clustering …
HVQ (and similar criteria) allow only to compare VQ with the same K !
K=60
more general: heuristic compromise between “error” and “simplicity”
VQ and clustering

data
initial
prototypes
practical issues of VQ training:
solution: rank-based updates (winner, second, third,… )
dead
units
training
more general: local minima of the quantization error,
initialization-dependent outcome of training
competitive learning

Neural Gas (NG)
introduce rank-based neighborhood cooperativeness:
upon presentation of xμ :
• determine the rank of the prototypes
• update all prototypes:
with neighborhood function
and rank-based range λ
• potential annealing of λ from large to smaller values
[Martinetz, Berkovich, Schulten, IEEE Trans. Neural Netw. 1993]
many prototypes (gas) to represent the density of observed data

Self-Organizing Map
T. Kohonen. Self-Organizing Maps. Springer (2nd edition 1997)
neighborhood cooperativeness on a predefined low-dim. lattice
lattice A of neurons
i.e. prototypes
- update winner and neighborhood:
where
range ρ w.r.t. distances in lattice A
upon presentation of xμ :
- determine the winner (best matching unit)
at position s in the lattice

- lattice deforms reflecting the density of observation
© Wikipedia
SOM provides topology preserving low-dim representation
e.g. for inspection and visualization of structured datasets
Self-Organizing Map

Self-Organizing Map
illustration: Iris flower data set [Fisher, 1936]:
4 num. features representing Iris flowers from 3 different species
SOM (4x6 prototypes in a 2-dim. grid)
training on 150 samples (without class label information)
component planes: 4 arrays representing the prototype values

U-Matrix: elements
Ur = average distance
d(wr,ws) from n.n. sites
reflects cluster structure
larger U at cluster borders
post labelling: assign
prototype to the majority
class of data it wins
Versicolor
Setosa
Virginica
(undefined)
here: Setosa well separated
from Virginica/Versicolor
Self-Organizing Map

Remarks:
- presentation of approaches not in historical order
- many extensions of the basic concept, e.g.
Generative Topographic Map (GTM), probabilistic
formulation of the mapping to low-dim. lattice
[Bishop, Svensen, Williams, 1998]
SOM and NG for specific types of data
- time series
- “non-vectorial” relational data
- graphs and trees
Vector Quantization

3. Supervised Learning
Potential aims:
- classification:
assign observations (data) to categories or classes
as inferred from labeled training data
- regression:
assign a continuous target value to an observation
dto.
- prediction:
predict the evolution of a time series (sequence)
inferred from observations of the history

distance based classification
assignment of data (objects, observations,...)
to one or several classes (crisp/soft) (categories, labels)
based on comparison with reference data (samples, prototypes)
in terms of a distance measure (dis-similarity, metric)
representation of data (a key step!)
- collection of qualitative/quantitative descriptors
- vectors of numerical features
- sequences, graphs, functional data
- relational data, e.g. in terms of pairwise (dis-) similarities

K-NN classifier
a simple distance-based classifier
- store a set of labeled examples
- classify a query according to the
label of the Nearest Neighbor
(or the majority of K NN)
- local decision boundary acc.
to (e.g.) Euclidean distances
?
- piece-wise linear class borders
parameterized by all examples
feature space
+ conceptually simple, no training required, one parameter (K)
- expensive storage and computation, sensitivity to “outliers”
can result in overly complex decision boundaries

prototype based classification
a prototype based classifier [Kohonen 1990, 1997]
- represent the data by one or
several prototypes per class
- classify a query according to the
label of the nearest prototype
(or alternative schemes)
- local decision boundaries according
to (e.g.) Euclidean distances
- piece-wise linear class borders
parameterized by prototypes
feature space
?
+ less sensitive to outliers, lower storage needs, little computational
effort in the working phase
- training phase required in order to place prototypes,
model selection problem: number of prototypes per class, etc.

set of prototypes
carrying class-labels
based on dissimilarity/distance measure
nearest prototype classifier (NPC):
given - determine the winner
- assign x to the class
most prominent example: (squared) Euclidean distance
Nearest Prototype Classifier
reasonable requirements:

∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
• initialize prototype vectors
for different classes
heuristic scheme: LVQ1 [Kohonen, 1990, 1997]
• identify the winner
(closest prototype)
• present a single example
• move the winner
- closer towards the data (same class)
- away from the data (different class)

∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
∙ tesselation of feature space
[piece-wise linear]
∙ distance-based classification
[here: Euclidean distances]
∙ generalization ability
correct classification of new data
∙ aim: discrimination of classes
( ≠ vector quantization
or density estimation )



sequential presentation of labelled examples
… the winner takes it all:
learning rate
many heuristic variants/modifications: [Kohonen, 1990,1997]
- learning rate schedules ηw (t)
- update more than one prototype per step
iterative training procedure:
randomized initial , e.g. close to the class-conditional means
LVQ1
LVQ1 update step:

LVQ1 update step:
LVQ1-like update for
generalized distance:
requirement:
update decreases (increases) distance if classes coincide (are different)
LVQ1

Generalized LVQ
one example of cost function based training: GLVQ [Sato & Yamada, 1995]
sigmoidal (linear for small arguments), e.g.
E approximates number of misclassifications
linear
E favors large margin separation of classes, e.g.
two winning prototypes:
minimize

GLVQ
training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples,
update of two prototypes per step
based on non-negative, differentiable distance

GLVQ
training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples,
update of two prototypes per step
based on Euclidean distance
moves prototypes towards / away from
sample with prefactors

+ frequently applied in a
variety of practical problems
+ intuitive interpretation
prototypes defined in feature space
+ natural for multi-class problems
- often based on purely heuristic arguments … or …
cost functions with unclear relation to classification error
Important issue: which is the ‘right’ distance measure ?
prototype/distance based classifiers
- model/parameter selection (# of prototypes, learning rate, …)
features may
- scale differently
- be of completely different nature
- be highly correlated / dependent
…
simple Euclidean distance ?
+ flexible, easy to implement

distance measures
fixed distance measures:
- select distance measures according to prior knowledge
- data driven choice in a preprocessing step
- determine prototypes for a given distance
- compare performance of various measures
example: divergence based LVQ

Relevance Matrix LVQ
generalized quadratic distance in LVQ:
variants:
one global, several local, class-wise relevance matrices
→ piecewise quadratic decision boundaries
rectangular discriminative low-dim. representation
e.g. for visualization [Bunte et al., 2012]
possible constraints: rank-control, sparsity, …
normalization:
diagonal matrices: single feature weights [Bojer et al., 2001]
[Hammer et al., 2002]
[Schneider et al., 2009]

Generalized Relevance Matrix LVQ
Generalized Matrix-LVQ
(GMLVQ)
gradients of cost function:
optimization of prototypes and distance measure

heuristic interpretation
summarizes
- the contribution of the original dimension
- the relevance of original features for the classification
interpretation assumes implicitly:
features have equal order of magnitude
e.g. after z-score-transformation →
(averages over data set)
standard Euclidean distance for
linearly transformed features

Iris flower data revisited (supervised analysis by GMLVQ)
GMLVQ
prototypes
relevance
matrix

empirical observation / theory:
relevance matrix becomes
singular, dominated by
very few eigenvectors
prevents over-fitting in
high-dim. feature spaces
facilitates discriminative
visualization of datasets
confirms: Setosa well-separated
from Virginica / Versicolor

projection on first eigenvector
projectiononsecondeigenvector a multi-class example
classification of coffee samples
based on hyperspectral data
(256-dim. feature vectors)
[U. Seiffert et al., IFF Magdeburg]
prototypes

optimization of
prototype positions
distance measure(s)
in one training process
(≠ pre-processing)
motivation:
improved performance
- weighting of features and pairs of features
simplified classification schemes
- elimination of non-informative, noisy features
- discriminative low-dimensional representation
insight into the data / classification problem
- identification of most discriminative features
- intrinsic low-dim. representation, visualization

related schemes
Relevance LVQ variants
local, rectangular, structured, restricted... relevance matrices
for visualization, functional data, texture recognition, etc.
relevance learning in Robust Soft LVQ, Supervised NG, etc.
combination of distances for mixed data ...
Relevance Learning related schemes in supervised learning ...
RBF Networks [Backhaus et al., 2012]
Neighborhood Component Analysis [Goldberger et al., 2005]
Large Margin Nearest Neighbor [Weinberger et al., 2006, 2010]
and many more!
Linear Discriminant Analysis (LDA)
one prototype per class + global matrix,
different objective function!

http://matlabserver.cs.rug.nl/gmlvqweb/web/
Matlab code:
Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ):
http://www.cs.rug.nl/~biehl/
links
Pre- and re-prints etc.:
A no-nonsense beginners’ tool for GMLVQ:
http://www.cs.rug.nl/~biehl/gmlvq
(see also: Tutorial, Thursday 9:30)

Questions ?
?

Prototype-based models in machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Prototype-based models in machine learning

Similar to Prototype-based models in machine learning (20)

More from University of Groningen

More from University of Groningen (19)

Recently uploaded

Recently uploaded (20)

Prototype-based models in machine learning