Your SlideShare is downloading.
×

- 1. Machine Learning for Language Technology Lecture 8: Decision Trees and k-‐Nearest Neighbors Marina San:ni Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden Autumn 2014 Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials 1
- 2. Supervised Classifica:on • Divide instances into (two or more) classes – Instance (feature vector): • Features may be categorical or numerical – Class (label): – Training data: • Classifica:on in Language Technology – Spam filtering (spam vs. non-‐spam) – Spelling error detec:on (error vs. no error) – Text categoriza:on (news, economy, culture, sport, ...) – Named en:ty classifica:on (person, loca:on, organiza:on, ...) 2 € X = {xt,y t }N t=1 € x = x1, …, xm € y
- 3. Models for Classifica:on • Genera:ve probabilis:c models: – Model of P(x, y) – Naive Bayes • Condi:onal probabilis:c models: – Model of P(y | x) – Logis:c regression • Discrimina:ve model: – No explicit probability model – Decision trees, nearest neighbor classifica:on – Perceptron, support vector machines, MIRA 3
- 4. Repea:ng… • Noise – Data cleaning is expensive and :me consuming • Margin • Induc:ve bias Types of inductive biases • Minimum cross-‐‑validation error • Maximum margin • Minimum description length • [...] 4
- 5. DECISION TREES 5
- 6. Decision Trees • Hierarchical tree structure for classifica:on – Each internal node specifies a test of some feature – Each branch corresponds to a value for the tested feature – Each leaf node provides a classifica:on for the instance • Represents a disjunc:on of conjunc:ons of constraints – Each path from root to leaf specifies a conjunc:on of tests – The tree itself represents the disjunc:on of all paths 6
- 7. Decision Tree 7
- 8. Divide and Conquer • Internal decision nodes – Univariate: Uses a single a]ribute, xi • Numeric xi : Binary split : xi > wm • Discrete xi : n-‐way split for n possible values – Mul:variate: Uses all a]ributes, x • Leaves – Classifica:on: class labels (or propor:ons) – Regression: r average (or local fit) • Learning: – Greedy recursive algorithm – Find best split X = (X1, ..., Xp), then induce tree for each Xi 8
- 9. Classifica:on Trees (ID3, CART, C4.5) 9 • For node m, Nm instances reach m, Ni m belong to Ci • Node m is pure if pi m i Nm is 0 or 1 • Measure of impurity is entropy KΣ Im = − pm i logpi 2m i=1 ˆP (Ci |x,m) ≡ pm i = Nm
- 10. Example: Entropy • Assume two classes (C1, C2) and four instances (x1, x2, x3, x4) • Case 1: – C1 = {x1, x2, x3, x4}, C2 = { } – Im = – (1 log 1 + 0 log 0) = 0 • Case 2: – C1 = {x1, x2, x3}, C2 = {x4} – Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81 • Case 3: – C1 = {x1, x2}, C2 = {x3, x4} – Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1 10
- 11. Best Split ˆP (Ci |x,m, j) ≡ pmj i = i Nmj Nmj i log2pmj 11 • If node m is pure, generate a leaf and stop, otherwise split with test t and con:nue recursively • Find the test that minimizes impurity • Impurity ager split with test t: – Nmj of Nm take branch j – Ni mj belong to Ci Im nΣ t = − Nmj Nj=1 m KΣ pmj i KΣ i=1 € Im = − pm i logpi 2m i=1 ˆP Ci ( |x,m) ≡ pm i = Ni m Nm
- 12. 12
- 13. Informa:on Gain • We want to determine which a]ribute in a given set of training feature vectors is most useful for discrimina:ng between the classes to be learned. • •Informa:on gain tells us how important a given a]ribute of the feature vectors is. • •We will use it to decide the ordering of a]ributes in the nodes of a decision tree. 13
- 14. Informa:on Gain and Gain Ra:o € KΣ Im = − pm i logpi 2m i=1 i log2pmj Nmj Nj=1 m 14 • Choosing the test that minimizes impurity maximizes the informa:on gain (IG): • Informa:on gain prefers features with many values • The normalized version is called gain ra:o (GR): € Im nΣ t = − Nmj j=1 Nm pmj i KΣ i=1 € IGm t = I− It m m Vt = − m Nmj Nm log2 nΣ € GRm t = IGt m Vm t
- 15. Pruning Trees • Decision trees are suscep:ble to overfijng • Remove subtrees for be]er generaliza:on: – Prepruning: Early stopping (e.g., with entropy threshold) – Postpruning: Grow whole tree, then prune subtrees • Prepruning is faster, postpruning is more accurate (requires a separate valida:on set) 15
- 16. Rule Extrac:on from Trees 16 C4.5Rules (Quinlan, 1993)
- 17. Learning Rules • Rule induc:on is similar to tree induc:on but – tree induc:on is breadth-‐first – rule induc:on is depth-‐first (one rule at a :me) • Rule learning: – A rule is a conjunc:on of terms (cf. tree path) – A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests) – Sequen:al covering: Generate rules one at a :me un:l all posi:ve examples are covered – IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995) 17
- 18. Proper:es of Decision Trees • Decision trees are appropriate for classifica:on when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels) – Interpreta:on of learned model is important (rules) • Induc0ve bias of (most) decision tree learners: 1. Prefers trees with informa0ve aEributes close to the root 2. Prefers smaller trees over bigger ones (with pruning) 3. Preference bias (incomplete search of complete space) 18
- 19. K-‐NEAREST NEIGHBORS 19
- 20. Nearest Neighbor Classifica:on • An old idea • Key components: – Storage of old instances – Similarity-‐based reasoning to new instances 20 This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)
- 21. k-‐Nearest Neighbour • Learning: – Store training instances in memory • Classifica:on: – Given new test instance x, • Compare it to all stored instances • Compute a distance between x and each stored instance xt • Keep track of the k closest instances (nearest neighbors) – Assign to x the majority class of the k nearest neighbours • A geometric view of learning – Proximity in (feature) space à same class – The smoothness assump:on 21
- 22. Eager and Lazy Learning • Eager learning (e.g., decision trees) – Learning – induce an abstract model from data – Classifica:on – apply model to new data • Lazy learning (a.k.a. memory-‐based learning) – Learning – store data in memory – Classifica:on – compare new data to data in memory – Proper:es: • Retains all the informa:on in the training set – no abstrac:on • Complex hypothesis space – suitable for natural language? • Main drawback – classifica:on can be very inefficient 22
- 23. Dimensions of a k-‐NN Classifier • Distance metric – How do we measure distance between instances? – Determines the layout of the instance space • The k parameter – How large neighborhood should we consider? – Determines the complexity of the hypothesis space 23
- 24. Distance Metric 1 • Overlap = count of mismatching features 24 € mΣ Δ(x,z) = δ (xi,zi ) i=1 ⎧ ⎪⎪⎪ 0 ⎨ ⎪⎪⎪ ⎩ x z if numeric else if x = z i i ≠ − i i − = i i i i i i if x z x z 1 , max min δ ( , )
- 25. Distance Metric 2 • MVDM = Modified Value Difference Metric 25 Δ(mΣ x,z) = δ (xi,zi ) i=1 δ KΣ (xi,zi ) = P(Cj | xi ) − P(Cj | zi) j=1
- 26. The k parameter • Tunes the complexity of the hypothesis space – If k = 1, every instance has its own neighborhood – If k = N, all the feature space is one neighborhood 26 k = 1 k = 15 ˆE MΣ = E(h|V) = 1 h xt ( ) ≠ rt ( ) t=1
- 27. A Simple Example Δ(mΣ x,z) = δ (xi,zi ) i=1 27 Training set: 1. (a, b, a, c) à A 2. (a, b, c, a) à B 3. (b, a, c, c) à C 4. (c, a, b, c) à A New instance: 5. (a, b, b, a) Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3 k-‐NN classifica:on: 1-‐NN(5) = B 2-‐NN(5) = A/B 3-‐NN(5) = A 4-‐NN(5) = A ⎧ ⎪⎪⎪ 0 ⎨ ⎪⎪⎪ ⎩ x z if numeric else if x = z i i ≠ − i i − = i i i i i i if x z x z 1 , max min δ ( , )
- 28. Further Varia:ons on k-‐NN • Feature weights: – The overlap metric gives all features equal weight – Features can be weighted by IG or GR • Weighted vo:ng: – The normal decision rule gives all neighbors equal weight – Instances can be weighted by (inverse) distance 28
- 29. Proper:es of k-‐NN • Nearest neighbor classifica:on is appropriate when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels) – Fast classifica:on is not crucial • Induc0ve bias of k-‐NN: 1. Nearby instances should have the same label (smoothness assump0on) 2. All features are equally important (without feature weights) 3. Complexity tuned by the k parameter 29
- 30. End of Lecture 8 30