Poggi analytics - distance - 1a

Buenos Aires, marzo de 2016
Eduardo Poggi
www.umiacs.umd.edu/~mrastega/

Instance Based Learning
 Distancias
 Introducción
 k-nearest neighbor
 Locally weighted regression
 Radial Basis Functions
 Case-Based Reasoning
 Reducción de instancias

Distancias
¿Y si es para algunos en lugar de para todos?¿Y si es para algunos en lugar de para todos?

Distancias
Autos Motos Elect. Juguet. Golosinas Trigo Pollos
Autos 1 0.8 0.5 0.2 0.1 0 0
Motos 1 0.5 0.2 0.1 0 0
Elect. 1 0.2 0.1 0 0
Juguet. 1 0.1 0 0
Golosinas 1 0.5 0.5
Trigo 1 0.7
Pollos 1

Distancias
Distancia de Levenshtein, distancia de edición, o distancia entre palabras,
al número mínimo de operaciones requeridas para transformar una cadena de
caracteres en otra. Se entiende por operación: inserción, eliminación o la
sustitución de un carácter.
https://en.wikibooks.org/wiki/Algorithm_I
mplementation/Strings/Levenshtein_dista
nce#Python

Distancias
www.sc.ehu.es/ccwgrrom/transparencias/pdf-
vision-1-transparencias/capitulo-1.pdf

Distancias
http://www.nidokidos.org/thr
eads/29243-Animals-
humans-face-similarity-
funny-pics!!

Distancias
http://lear.inrialpes.fr/people/nowak/similarity/

Distancias
ProductoProducto
ComestiblesComestibles LimpiezaLimpieza IndumentariaIndumentaria
AnimalAnimal VegetalVegetal MineralMineral
LácteosLácteos CárnicosCárnicos
Leche liquidaLeche liquida Leche fermentadaLeche fermentada QuesosQuesos MantecaManteca
Yogurt enteroYogurt entero Yogurt descremadoYogurt descremado
Yogurt naturalYogurt natural Yogurt saborizadoYogurt saborizado

¿IBL?
 La idea es simple:
 La clase de una instancia debe ser similar a la clase asociada e ejemplos
parecidos.

Almacenar todo los ejemplos.

Cuando se recibe una instancia para clasificar se buscan los ejemplos “más
parecidos” y se analizan las clases asignadas.
 Pero:
 La clasificación puede ser costosa
 ¿Todos los atributos son igual de relevantes?
 ¿Cuántos son los ejemplos parecidos?
 ¿Si los ejemplos parecidos tienen clases disímiles?
 ¿Todos lso ejemplos parecidos “pesan” igual?
 ¿Qué tan parecidos deben ser los parecidos?

K-nearest neighbor
 To define how similar two examples are we need a metric.
 We assume all examples are points in an n-dimensional space Rn and use the
Euclidean distance:
 Let Xi and Xj be two examples. Their distance d(Xi,Xj) is defined as:
d(Xi, Xj) = ( Σk [xik – xjk]2 ) ** 1/2
Where xik is the value of attribute k on example Xi.

K-nearest neighbor for discrete classes
K = 4
New example

Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian
2. How many nearby neighbors to look at?
One
3. A weighting function (optional)
Unused
4. How to fit with the local points?
Just predict the same output as the nearest neighbor.

Voronoi Diagram
Decision surface induced by a 1-nearest neighbor. The decision
surface is a combination of convex polyhedra surrounding each
training example.

k-Nearest Neighbour Classification Method
 Key idea: keep all the training instances
 Given query example, take vote amongst its k neighbours
 Neighbours are determined by using a distance function

(k=1)
(k=4)
Probability interpretation: estimate p(y|x) as
( ){ }, | , ( )
( | ) , ( ) is the neighborhood around
| ( ) |
i i i ix y y y x N x
p y x N x x
N x
= ∈
=
Sample adapted from Rong Jin’s slides

 Advantages:
 Training is really fast
 Can learn complex target functions
 Disadvantages
 Slow at query time: Efficient data structures are needed to speed
up the query

How to choose k?
 Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and
hide its class label
2. Using the remaining data and given k to
predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is
different from the true label
Repeat the procedure until all training examples
are tested
Choose the k whose Err(k) is minimal

How to choose k?
For k = 1, 2, …, K
Err(k) = 0;
are tested
(k=1)

How to choose k?
For k = 1, 2, …, K
Err(k) = 0;
are tested
Err(1) = 1

How to choose k?
For k = 1, 2, …, K
Err(k) = 0;
are tested
Err(1) = 3
Err(2) = 2
Err(3) = 6
k = 2

K-nearest neighbor for discrete classes
 Algorithm (parameter k)
 For each training example (X,C(X)) add the example to our training
list.
 When a new example Xq arrives, assign class:

C(Xq) = majority voting on the k nearest neighbors of Xq

C(Xq) = argmax v Σi δ(v, C(Xi))
 where δ(a,b) = 1 if a = b and 0 otherwise

K-nearest neighbor for real-valued functions
 Algorithm (parameter k)
 For each training example (X,C(X))
 add the example to our training list.
 When a new example Xq arrives, assign class:
 C(Xq) = average value among k nearest neighbors of Xq
 C(Xq) = Σ C(Xi) / k

Distance Weighted Nearest Neighbor
 It makes sense to weight the contribution of each
example according to the distance to the new query
example.
 C(Xq) = argmax v Σi wi δ(v, C(Xi))
For example, wi = 1 / d(Xq,Xi)

Nearest Neighbor
Four things make a memory based learner:
Euclidian
k
1 / d(Xq,Xi)
Just predict the same output as the nearest neighbor.

Distance Weighted Nearest Neighbor for Real-Valued
Functions
 For real valued functions we average based on the weight
function and normalize using the sum of all weights.
 C(Xq) = Σi wi C(Xi) / Σ wi

Problems with k-nearest Neighbor
 The distance between examples is based on all attributes. What if some
attributes are irrelevant?
 Consider the curse of dimensionality.
 The larger the number of irrelevant attributes, the higher the effect on the
nearest-neighbor rule.
 One solution is to use weights on the attributes. This is like stretching or
contracting the dimensions on the input space.
 Ideally we would like to eliminate all irrelevant attributes.

Locally Weighted Regression
 Let’s remember some terminology:
 Regression: Is a problem similar to classification but the value to predict
is a real number.
 Residual: The difference between the true target value f and our
approximation f’: f(X) – f’(X)
 Kernel Function: The distance function that provides a weight to each
example. The kernel function K is a function of the distance between
examples: K = f(d(Xi,Xq))

 The method is called locally weighted regression for the following
reasons:
 “Locally” because the predicted value for an example Xq is based only on
the vicinity or neighborhood around Xq.
 “Weighted” because the contribution of each neighbor of Xq will depend
on the distance between the neighbor example and Xq.
 “Regression” because the value to predict will be a real number.

 Consider the problem of approximating a target function using a
linear combination of attribute values:
 f’(X) = w0 + w1x1 + w2x2 + … + wnxn
 where X = (x1, x2, …, xn)
 We want to find those coefficients that minimize the error: E = ½ Σk [f(X)
– f’(X)]2

 If we do this in the vicinity of an example Xq and we wish to use a
kernel function, we get a form of locally weighted regression:
 E(Xq) = ½ Σk ( [f(X) – f’(X)]2 K(d(Xq,X) )
 where the sum now goes over the neighbors of Xq.

 Using gradient descent search, the update rule is defined
as:

ΔΔ Wj = n Σk [f(X) – f’(X)] K(d(Xq,X) xj
 where n is the learning rate and xj is the jth attribute of example
X.

Then here are some
commonly used
weighting functions…
(we use a Gaussian)

Nearest Neighbor
Scaled Euclidian
All of them
w_k = exp(-D(x_k , x_query )^2 / Kw^2 )
First form a local linear model. Find the β that
minimizes the locally weighted sum of squared residuals:

 Remarks:
 The literature contains other functions that are non linear.
 There are many variations to locally weighted regression that use
different kernel functions.
 Normally a linear model is sufficiently good to approximate the local
neighborhood of an example.

eduardopoggi@yahoo.com.ar
eduardo-poggi
http://ar.linkedin.com/in/eduardoapoggi
https://www.facebook.com/eduardo.poggi
@eduardoapoggi

Poggi analytics - distance - 1a

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Poggi analytics - distance - 1a

Similar to Poggi analytics - distance - 1a (20)

More from Gaston Liberman

More from Gaston Liberman (13)

Recently uploaded

Recently uploaded (20)

Poggi analytics - distance - 1a

Editor's Notes