Lecture 2:
Decision Trees – Nearest Neighbors
Last Updated: 09 Sept 2013
Uppsala University
Department of Linguistics and ...
Most slides from previous courses (i.e.
from E. Alpaydin and J. Nivre) with
adaptations
2 Lec 2: Decision Trees - Nearest ...
Supervised Classification
3
 Divide instances into (two or more) classes
 Instance (feature vector):
 Features may be c...
Concept=the thing to be learned (here, how to decide the genre of a new
webpage);
Instances=examples (here a webpage)
Feat...
Format for the software you chose (here
arff)
Lec 2: Decision Trees - Nearest Neighbors5
Models for Classification
6
 Generative probabilistic models:
 Model of P(x, y)
 Naive Bayes
 Conditional probabilisti...
Repeating…
Lec 2: Decision Trees - Nearest Neighbors7
 Noise
 Data cleaning is expensive
and time consuming
 Margin
 I...
Decision Trees
Lec 2: Decision Trees - Nearest Neighbors8
Decision Trees
9
 Hierarchical tree structure for classification
 Each internal node specifies a test of some
feature
 ...
Decision Tree
Lec 2: Decision Trees - Nearest Neighbors10
Divide and Conquer
Lec 2: Decision Trees - Nearest Neighbors11
 Internal decision nodes
 Univariate: Uses a single attri...
Classification Trees
(ID3, CART, C4.5)
Lec 2: Decision Trees - Nearest Neighbors12
ˆP Ci | x,m pm
i Nm
i
Nm
 For node m, ...
Example: Entropy
13
 Assume two classes (C1, C2) and four instances (x1, x2,
x3, x4)
 Case 1:
 C1 = {x1, x2, x3, x4}, C...
Best Split
Lec 2: Decision Trees - Nearest Neighbors14
ˆP Ci |x,m, j pmj
i Nmj
i
Nmj
 If node m is pure, generate a leaf ...
Lec 2: Decision Trees - Nearest Neighbors15
Information Gain
Lec 2: Decision Trees - Nearest Neighbors16
 We want to determine which attribute in a given set
of trai...
Information Gain and Gain Ratio
17
 Choosing the test that
minimizes impurity maximizes
the information gain (IG):
 Info...
Pruning Trees
Lec 2: Decision Trees - Nearest Neighbors18
 Decision trees are susceptible to overfitting
 Remove subtree...
Rule Extraction from Trees
Lec 2: Decision Trees - Nearest Neighbors
19
C4.5Rules
(Quinlan, 1993)
Learning Rules
Lec 2: Decision Trees - Nearest Neighbors20
 Rule induction is similar to tree induction but
 tree induct...
Properties of Decision Trees
21
 Decision trees are appropriate for classification
when:
 Features can be both categoric...
k-Nearest Neighbors
Lec 2: Decision Trees - Nearest Neighbors22
Nearest Neighbor Classification
23
 An old idea
 Key components:
 Storage of old instances
 Similarity-based reasoning...
k-Nearest Neighbour
24
 Learning:
 Store training instances in memory
 Classification:
 Given new test instance x,
 C...
Eager and Lazy Learning
25
 Eager learning (e.g., decision trees)
 Learning – induce an abstract model from data
 Class...
Dimensions of a k-NN Classifier
26
 Distance metric
 How do we measure distance between instances?
 Determines the layo...
Distance Metric 1
27
 Overlap = count of mismatching features
(x,z) (xi,zi )
i 1
m
ii
ii
ii
ii
ii
zxif
zxif
elsenumericif...
Distance Metric 2
28
 MVDM = Modified Value Difference Metric
(x,z) (xi,zi )
i 1
m
(xi,zi) P(Cj | xi) P(Cj | zi)
j 1
K
Le...
The k parameter
29
 Tunes the complexity of the hypothesis space
 If k = 1, every instance has its own neighborhood
 If...
A Simple Example
30
Training set:
1. (a, b, a, c)  A
2. (a, b, c, a)  B
3. (b, a, c, c)  C
4. (c, a, b, c)  A
New inst...
Further Variations on k-NN
31
 Feature weights:
 The overlap metric gives all features equal weight
 Features can be we...
Properties of k-NN
32
 Nearest neighbor classification is appropriate when:
 Features can be both categorical and numeri...
Reading
Lec 2: Decision Trees - Nearest Neighbors33
 Alpaydin (2010): Ch. 3, 9
 Daumé (2012): Ch. 1-2
 Witten and Frank...
End of Lecture 2
Thanks for your attention
Lec 2: Decision Trees - Nearest Neighbors34
Upcoming SlideShare
Loading in …5
×

Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors

919
-1

Published on

In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
919
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
37
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors

  1. 1. Lecture 2: Decision Trees – Nearest Neighbors Last Updated: 09 Sept 2013 Uppsala University Department of Linguistics and Philology, September 2013 Machine Learning for Language Technology (Schedule) 1 Lec 2: Decision Trees - Nearest Neighbors
  2. 2. Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations 2 Lec 2: Decision Trees - Nearest Neighbors
  3. 3. Supervised Classification 3  Divide instances into (two or more) classes  Instance (feature vector):  Features may be categorical or numerical  Class (label):  Training data:  Classification in Language Technology  Spam filtering (spam vs. non-spam)  Spelling error detection (error vs. no error)  Text categorization (news, economy, culture, sport, ...)  Named entity classification (person, location, organization, ...) X {xt ,yt }t 1 N x x1, , xm y Lec 2: Decision Trees - Nearest Neighbors
  4. 4. Concept=the thing to be learned (here, how to decide the genre of a new webpage); Instances=examples (here a webpage) Features=attributes (here morphological features, normalized counts) Classes=labels (here webgenres) Lec 2: Decision Trees - Nearest Neighbors4
  5. 5. Format for the software you chose (here arff) Lec 2: Decision Trees - Nearest Neighbors5
  6. 6. Models for Classification 6  Generative probabilistic models:  Model of P(x, y)  Naive Bayes  Conditional probabilistic models:  Model of P(y | x)  Logistic regression (next time)  Discriminative model:  No explicit probability model  Decision trees, nearest neighbor classification (today)  Perceptron, support vector machines, MIRA (next time) Lec 2: Decision Trees - Nearest Neighbors
  7. 7. Repeating… Lec 2: Decision Trees - Nearest Neighbors7  Noise  Data cleaning is expensive and time consuming  Margin  Inductive bias Types of inductive biases •Minimum cross-validation error •Maximum margin •Minimum description length •[...]
  8. 8. Decision Trees Lec 2: Decision Trees - Nearest Neighbors8
  9. 9. Decision Trees 9  Hierarchical tree structure for classification  Each internal node specifies a test of some feature  Each branch corresponds to a value for the tested feature  Each leaf node provides a classification for the instance  Represents a disjunction of conjunctions of constraints  Each path from root to leaf specifies a conjunction of tests  The tree itself represents the disjunction of all paths Lec 2: Decision Trees - Nearest Neighbors
  10. 10. Decision Tree Lec 2: Decision Trees - Nearest Neighbors10
  11. 11. Divide and Conquer Lec 2: Decision Trees - Nearest Neighbors11  Internal decision nodes  Univariate: Uses a single attribute, xi  Numeric xi : Binary split : xi > wm  Discrete xi : n-way split for n possible values  Multivariate: Uses all attributes, x  Leaves  Classification: class labels (or proportions)  Regression: r average (or local fit)  Learning:  Greedy recursive algorithm  Find best split X = (X1, ..., Xp), then induce tree for each Xi
  12. 12. Classification Trees (ID3, CART, C4.5) Lec 2: Decision Trees - Nearest Neighbors12 ˆP Ci | x,m pm i Nm i Nm  For node m, Nm instances reach m, Ni m belong to Ci  Node m is pure if pi m is 0 or 1  Measure of impurity is entropy Im pm i log2 pm i i 1 K
  13. 13. Example: Entropy 13  Assume two classes (C1, C2) and four instances (x1, x2, x3, x4)  Case 1:  C1 = {x1, x2, x3, x4}, C2 = { }  Im = – (1 log 1 + 0 log 0) = 0  Case 2:  C1 = {x1, x2, x3}, C2 = {x4}  Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81  Case 3:  C1 = {x1, x2}, C2 = {x3, x4}  Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1 Lec 2: Decision Trees - Nearest Neighbors
  14. 14. Best Split Lec 2: Decision Trees - Nearest Neighbors14 ˆP Ci |x,m, j pmj i Nmj i Nmj  If node m is pure, generate a leaf and stop, otherwise split with test t and continue recursively  Find the test that minimizes impurity  Impurity after split with test t:  Nmj of Nm take branch j  Ni mj belong to Ci Im t Nmj Nmj 1 n pmj i log2 pmj i i 1 K Im pm i log2 pm i i 1 K ˆP Ci | x,m pm i Nm i Nm
  15. 15. Lec 2: Decision Trees - Nearest Neighbors15
  16. 16. Information Gain Lec 2: Decision Trees - Nearest Neighbors16  We want to determine which attribute in a given set of training feature vectors is most usefulfor discriminating between the classes to be learned.  •Information gain tells us how important a given attribute of the feature vectors is.  •We will use it to decide the ordering of attributes in the nodes of a decision tree.
  17. 17. Information Gain and Gain Ratio 17  Choosing the test that minimizes impurity maximizes the information gain (IG):  Information gain prefers features with many values  The normalized version is called gain ratio (GR): Im t Nmj Nmj 1 n pmj i log2 pmj i i 1 K IGm t Im Im t Vm t Nmj Nm log2 Nmj Nmj 1 n GRm t IGm t Vm t Lec 2: Decision Trees - Nearest Neighbors Im pm i log2 pm i i 1 K
  18. 18. Pruning Trees Lec 2: Decision Trees - Nearest Neighbors18  Decision trees are susceptible to overfitting  Remove subtrees for better generalization:  Prepruning: Early stopping (e.g., with entropy threshold)  Postpruning: Grow whole tree, then prune subtrees  Prepruning is faster, postpruning is more accurate (requires a separate validation set)
  19. 19. Rule Extraction from Trees Lec 2: Decision Trees - Nearest Neighbors 19 C4.5Rules (Quinlan, 1993)
  20. 20. Learning Rules Lec 2: Decision Trees - Nearest Neighbors20  Rule induction is similar to tree induction but  tree induction is breadth-first  rule induction is depth-first (one rule at a time)  Rule learning:  A rule is a conjunction of terms (cf. tree path)  A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests)  Sequential covering: Generate rules one at a time until all positive examples are covered  IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)
  21. 21. Properties of Decision Trees 21  Decision trees are appropriate for classification when:  Features can be both categorical and numeric  Disjunctive descriptions may be required  Training data may be noisy (missing values, incorrect labels)  Interpretation of learned model is important (rules)  Inductive bias of (most) decision tree learners: 1. Prefers trees with informative attributes close to the root 2. Prefers smaller trees over bigger ones (with pruning) 3. Preference bias (incomplete search of complete Lec 2: Decision Trees - Nearest Neighbors
  22. 22. k-Nearest Neighbors Lec 2: Decision Trees - Nearest Neighbors22
  23. 23. Nearest Neighbor Classification 23  An old idea  Key components:  Storage of old instances  Similarity-based reasoning to new instances This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952) Lec 2: Decision Trees - Nearest Neighbors
  24. 24. k-Nearest Neighbour 24  Learning:  Store training instances in memory  Classification:  Given new test instance x,  Compare it to all stored instances  Compute a distance between x and each stored instance xt  Keep track of the k closest instances (nearest neighbors)  Assign to x the majority class of the k nearest neighbours  A geometric view of learning  Proximity in (feature) space  same class  The smoothness assumption Lec 2: Decision Trees - Nearest Neighbors
  25. 25. Eager and Lazy Learning 25  Eager learning (e.g., decision trees)  Learning – induce an abstract model from data  Classification – apply model to new data  Lazy learning (a.k.a. memory-based learning)  Learning – store data in memory  Classification – compare new data to data in memory  Properties:  Retains all the information in the training set – no abstraction  Complex hypothesis space – suitable for natural language?  Main drawback – classification can be very inefficient Lec 2: Decision Trees - Nearest Neighbors
  26. 26. Dimensions of a k-NN Classifier 26  Distance metric  How do we measure distance between instances?  Determines the layout of the instance space  The k parameter  How large neighborhood should we consider?  Determines the complexity of the hypothesis space Lec 2: Decision Trees - Nearest Neighbors
  27. 27. Distance Metric 1 27  Overlap = count of mismatching features (x,z) (xi,zi ) i 1 m ii ii ii ii ii zxif zxif elsenumericif zx zx 1 0 , minmax ),( Lec 2: Decision Trees - Nearest Neighbors
  28. 28. Distance Metric 2 28  MVDM = Modified Value Difference Metric (x,z) (xi,zi ) i 1 m (xi,zi) P(Cj | xi) P(Cj | zi) j 1 K Lec 2: Decision Trees - Nearest Neighbors
  29. 29. The k parameter 29  Tunes the complexity of the hypothesis space  If k = 1, every instance has its own neighborhood  If k = N, all the feature space is one neighborhood k = 1 k = 15 Lec 2: Decision Trees - Nearest Neighbors ˆE E(h | V ) 1 h xt rt t 1 M
  30. 30. A Simple Example 30 Training set: 1. (a, b, a, c)  A 2. (a, b, c, a)  B 3. (b, a, c, c)  C 4. (c, a, b, c)  A New instance: 5. (a, b, b, a) Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3 k-NN classification: 1-NN(5) = B 2-NN(5) = A/B 3-NN(5) = A 4-NN(5) = A Lec 2: Decision Trees - Nearest Neighbors (x,z) (xi,zi ) i 1 m ii ii ii ii ii zxif zxif elsenumericif zx zx 1 0 , minmax ),(
  31. 31. Further Variations on k-NN 31  Feature weights:  The overlap metric gives all features equal weight  Features can be weighted by IG or GR  Weighted voting:  The normal decision rule gives all neighbors equal weight  Instances can be weighted by (inverse) distance Lec 2: Decision Trees - Nearest Neighbors
  32. 32. Properties of k-NN 32  Nearest neighbor classification is appropriate when:  Features can be both categorical and numeric  Disjunctive descriptions may be required  Training data may be noisy (missing values, incorrect labels)  Fast classification is not crucial  Inductive bias of k-NN: 1. Nearby instances should have the same label (smoothness) 2. All features are equally important (without feature weights) 3. Complexity tuned by the k parameter Lec 2: Decision Trees - Nearest Neighbors
  33. 33. Reading Lec 2: Decision Trees - Nearest Neighbors33  Alpaydin (2010): Ch. 3, 9  Daumé (2012): Ch. 1-2  Witten and Frank (2005): Ch. 2  Tom Mitchell (1997) Ch.3: Decision Tree Learning
  34. 34. End of Lecture 2 Thanks for your attention Lec 2: Decision Trees - Nearest Neighbors34

×