Lecture7 - IBk

Introduction to Machine
Learning
Lecture 7
Instance Based Learning

Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld

Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull

Recap of Lecture 6

LET’S START WITH DATA
CLASSIFICATION

Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 6

Data Set Classification Model How?

We are going to deal with:
• Data described by nominal and continuous attributes
• Data that may have instances with missing values

Slide 3

Recap of Lecture 6
We want to build decision trees
How can I automatically
generate these types
of trees?
Decide which attribute we
should put in each node
Decide a split point

Rely on information theory
We also saw many other improvements

Slide 4

Today’s Agenda

Classification without building a model
K-Nearest Neighbor (kNN)
Effect of K
Distance functions
Variants of K-NN
Strengths and weaknesses

Slide 5

Classification without Building a Model

Forget about a global model!
g g
Simply store all the training examples
Build local model f each new t t i t
B ild a l l d l for h test instance
Refered to as lazy learners

Some approaches to IBL
Nearest neighbors
Locally weighted regression
Case-based reasoning

Slide 6

k-Nearest Neighbors
Algorithm
g
Store all the training data
Given a new t t instance
Gi test i t
Recover the k neighbors of the test instance
Predict th
P di t the majority class among the neighbors
j it l th i hb

Voronoi Cells: The feature space is
decomposed into several cells.
E.g. for k=1

Slide 7

k-Nearest Neighbors
But, where is the learning process?
, gp
Select the k neighbors and return the majority class is learning?
No, that’s just t i i
N th t’ j t retrieving

But still, some important issues
Which k should I use?
Which distance functions should I use?
Should I maintain all instances of the training data set?

Slide 8

Which k Should I Use?
The effect of k
15-NN 1-NN

Do you remember the discussion about overfitting in C4.5?
Apply the same concepts here!

Slide 9

Which k Should I Use?
Some experimental results on the use of different k
p
7-NN

Number of neighbors

Notice that the test error decreases as k increases but at k ≈ 5-
increases,
7, it starts increasing again
Rule of thumb: k=3 k=5 and k=7 seem to work ok in the
k=3, k=5,
majority of problems
Slide 10

Distance Functions
Distance functions must be able to
Nominal attributes
Continuous attributes
C ti tt ib t
Missing values
The key
They must return a low value for similar objects and a high
value for different objects
Seems obvious, right? But still, it is domain dependent
obvious still
There are many of them. Let’s see some of the most
used

Slide 11

Distance Functions
Distance between two points in the same space
p p
d(x, y)

Some properties expected to be satisfied in general
d(x, y) ≥ 0 and d(x, x) = 0
d(x y) = d(y x)
d(x, d(y,
d(x, y) + d(y, z) ≥ d(x, z)

Slide 12

Distances for Continuous Variables
Given x=(x1,…,xn)’ and y=(y1,…,yn)’
n
d E ( x, y ) = [∑ ( xi − yi ) 2 ]1/ 2
Euclidean
i =1

n
d E ( x, y ) = [∑ ( xi − yi ) ] q 1/ q
Minkowsky
i =1

n
d ABS ( x, y ) = ∑ | xi − yi |
Distance absolute value
i =1

Slide 13

What if attributes are measured over different scales?
Attribute 1 ranging in [0,1]
Attribute 2 ranging in [0 1000]
[0,
Can you detect any potential problem in the aforementioned
distance functions?

X in [0,1], y in [0,1000] X in [0,1000], y in [0,1000]

Slide 14

The larger the scale, the larger the influence of the
g , g
attribute in the distance function
Solution: Normalize each attribute
How:
Normalization by means of the range

d (ex1a , ex2 )
a
d anorm (ex1 , ex2 ) =
a a

max a − min a

Normalization by means of the standard deviation

d (ex1a , ex2 )
a
d anorm (ex1a , ex2 ) =
a

4σ a
Slide 15

Distances for Nominal Attributes
Several metrics to deal with nominal attributes
Overlap distance function

Idea: Two nominal attributes are equal only if they have the same
value

Slide 16

Distances for Nominal Attributes
Several metrics to deal with nominal attributes
Value difference metric (VDM)

C = number of classes
P(a exia, c) = conditional probability
P(a,
that the output class is c given that
the attribute a has de value exia.

Idea: Two nominal values are similar if they have more similar
correlations with the output classes
See (Wilson & Martinez) for more distance functions
Slide 17

Distances for Heterogeneous Attributes

What if my data set is described by both nominal and
continuous attributes?
Apply the same distance function
Use nominal distance functions for nominal attributes
Use continuous distance function for continuous attributes

Slide 18

Variants of kNN

Different variants of kNN
Distance-weighted kNN
Attribute-weighted kNN

Slide 19

Distance-Weighted kNN
Inference of original kNN
g
The k nearest neighbors vote for the class
Shouldn t
Shouldn’t the closest examples have a higher influence in the
decision process?
Weight the contribution of each of the k neighbors wrt their distance
E.g., k
f ( xq ) = arg max ∑ wiδ (v, f ( xi ))
ˆ k
∑ wi f ( xi )
v∈V i =1
f ( xq ) =
ˆ i =1
1 k
where wi = ∑ wi
d ( xq , xi ) 2 i =1

More robust to noisy instances and outliers

E.g.: Shepard’s method (Shepard,1968)

Slide 20

Attribute-weighted kNN
What if some attributes are irrelevant or misleading?
g
If irrelevant cost increases, but accuracy is not affected
If misleading
i l di cost increases and accuracy may d
ti d decrease

Weight attributes:
n
d w( x, y ) = ∑ wi ( xi − yi ) 2

i =1

How to determine the weights?
Option 1: The expert p
p p provide us with the weights
g
Option 2: Use a machine learning approach
More will be said in the next lecture!

Slide 21

Strengths and Weaknesses
Strengths of kNN
Building of a new local model for each test instance
Learning has no cost
Empirical results show that the method is highly accurate w.r.t other
machine learning techniques
Weaknesses
Retrieving approach, but does not learn
No global model. The knowledge is not legible
Test cost increases linearly with the input instances
No generalization
Curse of dimensionality: What happens if we have many attributes?
Noise and outliers may have a very negative effect
Slide 22

Next Class

From instance-based to case-based reasoning
A little bit more on learning
Distance functions
Prototype selection

Slide 23

Lecture7 - IBk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lecture7 - IBk

Similar to Lecture7 - IBk (20)

Recently uploaded

Recently uploaded (20)

Lecture7 - IBk