Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- SpeakUp – A Mobile App Facilitating... by Sten Govaerts 726 views
- Speakup - in NordiCHI2014 Personal ... by Sten Govaerts 511 views
- The Go-Lab project at the REACT Res... by Sten Govaerts 726 views
- The Smart Device Specification for ... by Sten Govaerts 689 views
- Learning Analytics Dashboards by Sten Govaerts 8884 views
- Towards an online lab portal for in... by Sten Govaerts 849 views

767 views

Published on

No Downloads

Total views

767

On SlideShare

0

From Embeds

0

Number of Embeds

30

Shares

0

Downloads

19

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Machine Learning for Language Technology Uppsala University Department of Linguistics and Philology Structured Prediction October 2013 Slides borrowed from previous courses. Thanks to Ryan McDonald (Google Research) and Prof. Joakim Nivre Machine Learning for Language Technology 1(36)
- 2. Structured Prediction Outline Last time: Preliminaries: input/output, features, etc. Linear classiﬁers Perceptron Large-margin classiﬁers (SVMs, MIRA) Logistic regression Today: Structured prediction with linear classiﬁers Structured perceptron Structured large-margin classiﬁers (SVMs, MIRA) Conditional random ﬁelds Case study: Dependency parsing Machine Learning for Language Technology 2(36)
- 3. Structured Prediction Structured Prediction (i) Sometimes our output space Y does not consist of simple atomic classes Examples: Parsing: for a sentence x, Y is the set of possible parse trees Sequence tagging: for a sentence x, Y is the set of possible tag sequences, e.g., part-of-speech tags, named-entity tags Machine translation: for a source sentence x, Y is the set of possible target language sentences Machine Learning for Language Technology 3(36)
- 4. Structured Prediction Hidden Markov Models Generative model – maximizes likelihood of P(x, y) We are looking at discriminative versions of this Not just sequences, though that will be the running example Machine Learning for Language Technology 4(36)
- 5. Structured Prediction Structured Prediction (ii) Can’t we just use our multiclass learning algorithms? In all the cases, the size of the set Y is exponential in the length of the input x It is non-trivial to apply our learning algorithms in such cases Machine Learning for Language Technology 5(36)
- 6. Structured Prediction Perceptron Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg maxy w(i) · f(xt, y) (**) 5. if y = yt 6. w(i+1) = w(i) + f(xt, yt) − f(xt, y ) 7. i = i + 1 8. return wi (**) Solving the argmax requires a search over an exponential space of outputs! Machine Learning for Language Technology 6(36)
- 7. Structured Prediction Large-Margin Classiﬁers Batch (SVMs): min 1 2 ||w||2 such that: w·f(xt, yt)−w·f(xt, y ) ≥ 1 ∀(xt, yt) ∈ T and y ∈ ¯Yt (**) Online (MIRA): Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = arg minw* w* − w(i) such that: w · f(xt, yt) − w · f(xt, y ) ≥ 1 ∀y ∈ ¯Yt (**) 5. i = i + 1 6. return wi (**) There are exponential constraints in the size of each input!! Machine Learning for Language Technology 7(36)
- 8. Structured Prediction Factor the Feature Representations We can make an assumption that our feature representations factor relative to the output Example: Context-free parsing: f(x, y) = A→BC∈y f(x, A → BC) Sequence analysis – Markov assumptions: f(x, y) = |y| i=1 f(x, yi−1, yi ) These kinds of factorizations allow us to run algorithms like CKY and Viterbi to compute the argmax function Machine Learning for Language Technology 8(36)
- 9. Structured Prediction Example – Sequence Labeling Many NLP problems can be cast in this light Part-of-speech tagging Named-entity extraction Semantic role labeling ... Input: x = x0x1 . . . xn Output: y = y0y1 . . . yn Each yi ∈ Yatom – which is small Each y ∈ Y = Yn atom – which is large Example: part-of-speech tagging – Yatom is set of tags x = John saw Mary with the telescope y = noun verb noun preposition article noun Machine Learning for Language Technology 9(36)
- 10. Structured Prediction Sequence Labeling – Output Interaction x = John saw Mary with the telescope y = noun verb noun preposition article noun Why not just break up sequence into a set of multi-class predictions? Because there are interactions between neighbouring tags What tag does“saw”have? What if I told you the previous tag was article? What if it was noun? Machine Learning for Language Technology 10(36)
- 11. Structured Prediction Sequence Labeling – Markov Factorization x = John saw Mary with the telescope y = noun verb noun preposition article noun Markov factorization – factor by adjacent labels First-order (like HMMs) f(x, y) = |y| i=1 f(x, yi−1, yi ) kth-order f(x, y) = |y| i=k f(x, yi−k, . . . , yi−1, yi ) Machine Learning for Language Technology 11(36)
- 12. Structured Prediction Sequence Labeling – Features x = John saw Mary with the telescope y = noun verb noun preposition article noun First-order f(x, y) = |y| i=1 f(x, yi−1, yi ) f(x, yi−1, yi ) is any feature of the input & two adjacent labels fj (x, yi−1, yi ) = 8 < : 1 if xi = “saw” and yi−1 = noun and yi = verb 0 otherwise fj (x, yi−1, yi ) = 8 < : 1 if xi = “saw” and yi−1 = article and yi = verb 0 otherwise wj should get high weight and wj should get low weight Machine Learning for Language Technology 12(36)
- 13. Structured Prediction Sequence Labeling - Inference How does factorization eﬀect inference? y = arg max y w · f(x, y) = arg max y w · |y| i=1 f(x, yi−1, yi ) = arg max y |y| i=1 w · f(x, yi−1, yi ) = arg max y |y| i=1 m j=1 wj · fj (x, yi−1, yi ) Can use the Viterbi algorithm Machine Learning for Language Technology 13(36)
- 14. Structured Prediction Sequence Labeling – Viterbi Algorithm Let αy,i be the score of the best labeling Of the sequence x0x1 . . . xi Where yi = y Let’s say we know α, then maxy αy,n is the score of the best labeling of the sequence αy,i can be calculated with the following recursion αy,0 = 0.0 ∀y ∈ Yatom αy,i = max y∗ αy∗,i−1 + w · f(x, y∗, y) Machine Learning for Language Technology 14(36)
- 15. Structured Prediction Sequence Labeling - Back-Pointers But that only tells us what the best score is Let βy,i be the ith label in the best labeling Of the sequence x0x1 . . . xi Where yi = y βy,i can be calculated with the following recursion βy,0 = nil ∀y ∈ Yatom βy,i = arg max y∗ αy∗,i−1 + w · f(x, y∗, y) Thus: The last label in the best sequence is yn = arg maxy αy,n The second-to-last label is yn−1 = βyn,n ... The ﬁrst label is y0 = βy1,1 Machine Learning for Language Technology 15(36)
- 16. Structured Prediction Structured Learning We know we can solve the inference problem At least for sequence labeling But also for many other problems where one can factor features appropriately (context-free parsing, dependency parsing, semantic role labeling, . . . ) How does this change learning? for the perceptron algorithm? for SVMs? for logistic regression? Machine Learning for Language Technology 16(36)
- 17. Structured Prediction Structured Perceptron Exactly like original perceptron Except now the argmax function uses factored features Which we can solve with algorithms like the Viterbi algorithm All of the original analysis carries over!! 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg maxy w(i) · f(xt , y) (**) 5. if y = yt 6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y ) 7. i = i + 1 8. return wi (**) Solve the argmax with Viterbi for sequence problems! Machine Learning for Language Technology 17(36)
- 18. Structured Prediction Online Structured SVMs (or Online MIRA) 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = arg minw* ‚ ‚w* − w(i) ‚ ‚ such that: w · f(xt , yt ) − w · f(xt , y ) ≥ L(yt , y ) ∀y ∈ ¯Yt and y ∈ k-best(xt , w(i)) 5. i = i + 1 6. return wi k-best(xt) is set of k outputs with highest scores using w(i) Simple solution – only consider highest-scoring output y ∈ ¯Yt Note: Old ﬁxed margin of 1 is now a ﬁxed loss L(yt, y ) between two structured outputs Machine Learning for Language Technology 18(36)
- 19. Structured Prediction Structured SVMs min 1 2 ||w||2 such that: w · f(xt, yt) − w · f(xt, y ) ≥ L(yt, y ) ∀(xt, yt) ∈ T and y ∈ ¯Yt Still have an exponential number of constraints Feature factorizations permit solutions (max-margin Markov networks, structured SVMs) Note: Old ﬁxed margin of 1 is now a ﬁxed loss L(yt, y ) between two structured outputs Machine Learning for Language Technology 19(36)
- 20. Structured Prediction Conditional Random Fields (i) What about structured logistic regression? Such a thing exists – Conditional Random Fields (CRFs) Consider again the sequential case with 1st order factorization Inference is identical to the structured perceptron – use Viterbi arg max y P(y|x) = arg max y ew·f(x,y) Zx = arg max y ew·f(x,y) = arg max y w · f(x, y) = arg max y X i=1 w · f(x, yi−1, yi ) Machine Learning for Language Technology 20(36)
- 21. Structured Prediction Conditional Random Fields (ii) However, learning does change Reminder: pick w to maximize log-likelihood of training data: w = arg max w t log P(yt|xt) Take gradient and use gradient ascent ∂ ∂wi F(w) = t fi (xt, yt) − t y ∈Y P(y |xt)fi (xt, y ) And the gradient is: F(w) = ( ∂ ∂w0 F(w), ∂ ∂w1 F(w), . . . , ∂ ∂wm F(w)) Machine Learning for Language Technology 21(36)
- 22. Structured Prediction Conditional Random Fields (iii) Problem: sum over output space Y ∂ ∂wi F(w) = X t fi (xt , yt ) − X t X y ∈Y P(y |xt )fi (xt , y ) = X t X j=1 fi (xt , yt,j−1, yt,j ) − X t X y ∈Y X j=1 P(y |xt )fi (xt , yj−1, yj ) Can easily calculate ﬁrst term – just empirical counts What about the second term? Machine Learning for Language Technology 22(36)
- 23. Structured Prediction Conditional Random Fields (iv) Problem: sum over output space Y t y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) We need to show we can compute it for arbitrary xt y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) Solution: the forward-backward algorithm Machine Learning for Language Technology 23(36)
- 24. Structured Prediction Forward Algorithm (i) Let αm u be the forward scores Let |xt| = n αm u is the sum over all labelings of x0 . . . xm such that ym = u αm u = |y |=m, ym=u ew·f(xt ,y ) = |y |=m ym=u e P j=1 w·f(xt ,yj−1,yj ) i.e., the sum of all labelings of length m, ending at position m with label u Note then that Zxt = y ew·f(xt ,y ) = u αn u Machine Learning for Language Technology 24(36)
- 25. Structured Prediction Forward Algorithm (ii) We can ﬁll in α as follows: α0 u = 1.0 ∀u αm u = v αm−1 v × ew·f(xt ,v,u) Machine Learning for Language Technology 25(36)
- 26. Structured Prediction Backward Algorithm Let βm u be the symmetric backward scores i.e., the sum over all labelings of xm . . . xn such that ym = u We can ﬁll in β as follows: βn u = 1.0 ∀u βm u = v βm+1 v × ew·f(xt ,u,v) Note: β is overloaded – diﬀerent from back-pointers Machine Learning for Language Technology 26(36)
- 27. Structured Prediction Conditional Random Fields - Final Let’s show we can compute it for arbitrary xt y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) So we can re-write it as: j=1 αj−1 yj−1 × ew·f(xt ,yj−1,yj ) × βj yj Zxt fi (xt, yj−1, yj ) Forward-backward can calculate partial derivatives eﬃciently Machine Learning for Language Technology 27(36)
- 28. Structured Prediction Conditional Random Fields Summary Inference: Viterbi Learning: Use the forward-backward algorithm What about not sequential problems Context-free parsing – can use inside-outside algorithm General problems – message passing & belief propagation Machine Learning for Language Technology 28(36)
- 29. Structured Prediction Case Study: Dependency Parsing Given an input sentence x, predict syntactic dependencies y Machine Learning for Language Technology 29(36)
- 30. Structured Prediction Model 1: Arc-Factored Graph-Based Parsing y = arg max y w · f(x, y) = arg max y (i,j)∈y w · f(i, j) (i, j) ∈ y means xi → xj , i.e., a dependency from xi to xj Solving the argmax w · f(i, j) is weight of arc A dependency tree is a spanning tree of a dense graph over x Use max spanning tree algorithms for inference Machine Learning for Language Technology 30(36)
- 31. Structured Prediction Deﬁning f(i, j) Can contain any feature over arc or the input sentence Some example features Identities of xi and xj Their part-of-speech tags The part-of-speech of surrounding words The distance between xi and xj ... Machine Learning for Language Technology 31(36)
- 32. Structured Prediction Empirical Results Spanning tree dependency parsing results (McDonald 2006) Trained using MIRA (online SVMs) English Czech Chinese Accuracy Complete Accuracy Complete Accuracy Complete 90.7 36.7 84.1 32.2 79.7 27.2 Simple structured linear classiﬁer Near state-of-the-art performance for many languages Higher-order models give higher accuracy Machine Learning for Language Technology 32(36)
- 33. Structured Prediction Model 2: Transition-Based Parsing y = arg max y w · f(x, y) = arg max y t(s)∈T(y) w · f(s, t) t(s) ∈ T(y) means that the derivation of y includes the application of transition t to state s Solving the argmax w · f(s, t) is score of transition t in state s Use beam search to ﬁnd best derivation from start state s0 Machine Learning for Language Technology 33(36)
- 34. Structured Prediction Deﬁning f(s, t) Can contain any feature over parser states Some example features Identities of words in s (e.g., top of stack, head of queue) Their part-of-speech tags Their head and dependents (and their part-of-speech tags) The number of dependents of words in s ... Machine Learning for Language Technology 34(36)
- 35. Structured Prediction Empirical Results Transition-based dependency parsing with beam search(**) (Zhang and Nivre 2011) Trained using perceptron English Chinese Accuracy Complete Accuracy Complete 92.9 48.0 86.0 36.9 Simple structured linear classiﬁer State-of-the-art performance with rich non-local features (**) Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-ﬁrst search that reduces its memory requirements. It only ﬁnds an approxiamate solution Machine Learning for Language Technology 35(36)
- 36. Structured Prediction Structured Prediction Summary Can’t use multiclass algorithms – search space too large Solution: factor representations Can allow for eﬃcient inference and learning Showed for sequence learning: Viterbi + forward-backward But also true for other structures CFG parsing: CKY + inside-outside Dependency parsing: spanning tree algorithms or beam search General graphs: message passing and belief propagation Machine Learning for Language Technology 36(36)

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment