Lecture 5: Structured Prediction

767 views

Published on

Structured prediction or structured learning refers to supervised machine learning techniques that involve predicting structured objects, rather than single labels or real values. For example, the problem of translating a natural language sentence into a syntactic representation such as a parse tree can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees.

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
767
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Lecture 5: Structured Prediction

  1. 1. Machine Learning for Language Technology Uppsala University Department of Linguistics and Philology Structured Prediction October 2013 Slides borrowed from previous courses. Thanks to Ryan McDonald (Google Research) and Prof. Joakim Nivre Machine Learning for Language Technology 1(36)
  2. 2. Structured Prediction Outline Last time: Preliminaries: input/output, features, etc. Linear classifiers Perceptron Large-margin classifiers (SVMs, MIRA) Logistic regression Today: Structured prediction with linear classifiers Structured perceptron Structured large-margin classifiers (SVMs, MIRA) Conditional random fields Case study: Dependency parsing Machine Learning for Language Technology 2(36)
  3. 3. Structured Prediction Structured Prediction (i) Sometimes our output space Y does not consist of simple atomic classes Examples: Parsing: for a sentence x, Y is the set of possible parse trees Sequence tagging: for a sentence x, Y is the set of possible tag sequences, e.g., part-of-speech tags, named-entity tags Machine translation: for a source sentence x, Y is the set of possible target language sentences Machine Learning for Language Technology 3(36)
  4. 4. Structured Prediction Hidden Markov Models Generative model – maximizes likelihood of P(x, y) We are looking at discriminative versions of this Not just sequences, though that will be the running example Machine Learning for Language Technology 4(36)
  5. 5. Structured Prediction Structured Prediction (ii) Can’t we just use our multiclass learning algorithms? In all the cases, the size of the set Y is exponential in the length of the input x It is non-trivial to apply our learning algorithms in such cases Machine Learning for Language Technology 5(36)
  6. 6. Structured Prediction Perceptron Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg maxy w(i) · f(xt, y) (**) 5. if y = yt 6. w(i+1) = w(i) + f(xt, yt) − f(xt, y ) 7. i = i + 1 8. return wi (**) Solving the argmax requires a search over an exponential space of outputs! Machine Learning for Language Technology 6(36)
  7. 7. Structured Prediction Large-Margin Classifiers Batch (SVMs): min 1 2 ||w||2 such that: w·f(xt, yt)−w·f(xt, y ) ≥ 1 ∀(xt, yt) ∈ T and y ∈ ¯Yt (**) Online (MIRA): Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = arg minw* w* − w(i) such that: w · f(xt, yt) − w · f(xt, y ) ≥ 1 ∀y ∈ ¯Yt (**) 5. i = i + 1 6. return wi (**) There are exponential constraints in the size of each input!! Machine Learning for Language Technology 7(36)
  8. 8. Structured Prediction Factor the Feature Representations We can make an assumption that our feature representations factor relative to the output Example: Context-free parsing: f(x, y) = A→BC∈y f(x, A → BC) Sequence analysis – Markov assumptions: f(x, y) = |y| i=1 f(x, yi−1, yi ) These kinds of factorizations allow us to run algorithms like CKY and Viterbi to compute the argmax function Machine Learning for Language Technology 8(36)
  9. 9. Structured Prediction Example – Sequence Labeling Many NLP problems can be cast in this light Part-of-speech tagging Named-entity extraction Semantic role labeling ... Input: x = x0x1 . . . xn Output: y = y0y1 . . . yn Each yi ∈ Yatom – which is small Each y ∈ Y = Yn atom – which is large Example: part-of-speech tagging – Yatom is set of tags x = John saw Mary with the telescope y = noun verb noun preposition article noun Machine Learning for Language Technology 9(36)
  10. 10. Structured Prediction Sequence Labeling – Output Interaction x = John saw Mary with the telescope y = noun verb noun preposition article noun Why not just break up sequence into a set of multi-class predictions? Because there are interactions between neighbouring tags What tag does“saw”have? What if I told you the previous tag was article? What if it was noun? Machine Learning for Language Technology 10(36)
  11. 11. Structured Prediction Sequence Labeling – Markov Factorization x = John saw Mary with the telescope y = noun verb noun preposition article noun Markov factorization – factor by adjacent labels First-order (like HMMs) f(x, y) = |y| i=1 f(x, yi−1, yi ) kth-order f(x, y) = |y| i=k f(x, yi−k, . . . , yi−1, yi ) Machine Learning for Language Technology 11(36)
  12. 12. Structured Prediction Sequence Labeling – Features x = John saw Mary with the telescope y = noun verb noun preposition article noun First-order f(x, y) = |y| i=1 f(x, yi−1, yi ) f(x, yi−1, yi ) is any feature of the input & two adjacent labels fj (x, yi−1, yi ) = 8 < : 1 if xi = “saw” and yi−1 = noun and yi = verb 0 otherwise fj (x, yi−1, yi ) = 8 < : 1 if xi = “saw” and yi−1 = article and yi = verb 0 otherwise wj should get high weight and wj should get low weight Machine Learning for Language Technology 12(36)
  13. 13. Structured Prediction Sequence Labeling - Inference How does factorization effect inference? y = arg max y w · f(x, y) = arg max y w · |y| i=1 f(x, yi−1, yi ) = arg max y |y| i=1 w · f(x, yi−1, yi ) = arg max y |y| i=1 m j=1 wj · fj (x, yi−1, yi ) Can use the Viterbi algorithm Machine Learning for Language Technology 13(36)
  14. 14. Structured Prediction Sequence Labeling – Viterbi Algorithm Let αy,i be the score of the best labeling Of the sequence x0x1 . . . xi Where yi = y Let’s say we know α, then maxy αy,n is the score of the best labeling of the sequence αy,i can be calculated with the following recursion αy,0 = 0.0 ∀y ∈ Yatom αy,i = max y∗ αy∗,i−1 + w · f(x, y∗, y) Machine Learning for Language Technology 14(36)
  15. 15. Structured Prediction Sequence Labeling - Back-Pointers But that only tells us what the best score is Let βy,i be the ith label in the best labeling Of the sequence x0x1 . . . xi Where yi = y βy,i can be calculated with the following recursion βy,0 = nil ∀y ∈ Yatom βy,i = arg max y∗ αy∗,i−1 + w · f(x, y∗, y) Thus: The last label in the best sequence is yn = arg maxy αy,n The second-to-last label is yn−1 = βyn,n ... The first label is y0 = βy1,1 Machine Learning for Language Technology 15(36)
  16. 16. Structured Prediction Structured Learning We know we can solve the inference problem At least for sequence labeling But also for many other problems where one can factor features appropriately (context-free parsing, dependency parsing, semantic role labeling, . . . ) How does this change learning? for the perceptron algorithm? for SVMs? for logistic regression? Machine Learning for Language Technology 16(36)
  17. 17. Structured Prediction Structured Perceptron Exactly like original perceptron Except now the argmax function uses factored features Which we can solve with algorithms like the Viterbi algorithm All of the original analysis carries over!! 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg maxy w(i) · f(xt , y) (**) 5. if y = yt 6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y ) 7. i = i + 1 8. return wi (**) Solve the argmax with Viterbi for sequence problems! Machine Learning for Language Technology 17(36)
  18. 18. Structured Prediction Online Structured SVMs (or Online MIRA) 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = arg minw* ‚ ‚w* − w(i) ‚ ‚ such that: w · f(xt , yt ) − w · f(xt , y ) ≥ L(yt , y ) ∀y ∈ ¯Yt and y ∈ k-best(xt , w(i)) 5. i = i + 1 6. return wi k-best(xt) is set of k outputs with highest scores using w(i) Simple solution – only consider highest-scoring output y ∈ ¯Yt Note: Old fixed margin of 1 is now a fixed loss L(yt, y ) between two structured outputs Machine Learning for Language Technology 18(36)
  19. 19. Structured Prediction Structured SVMs min 1 2 ||w||2 such that: w · f(xt, yt) − w · f(xt, y ) ≥ L(yt, y ) ∀(xt, yt) ∈ T and y ∈ ¯Yt Still have an exponential number of constraints Feature factorizations permit solutions (max-margin Markov networks, structured SVMs) Note: Old fixed margin of 1 is now a fixed loss L(yt, y ) between two structured outputs Machine Learning for Language Technology 19(36)
  20. 20. Structured Prediction Conditional Random Fields (i) What about structured logistic regression? Such a thing exists – Conditional Random Fields (CRFs) Consider again the sequential case with 1st order factorization Inference is identical to the structured perceptron – use Viterbi arg max y P(y|x) = arg max y ew·f(x,y) Zx = arg max y ew·f(x,y) = arg max y w · f(x, y) = arg max y X i=1 w · f(x, yi−1, yi ) Machine Learning for Language Technology 20(36)
  21. 21. Structured Prediction Conditional Random Fields (ii) However, learning does change Reminder: pick w to maximize log-likelihood of training data: w = arg max w t log P(yt|xt) Take gradient and use gradient ascent ∂ ∂wi F(w) = t fi (xt, yt) − t y ∈Y P(y |xt)fi (xt, y ) And the gradient is: F(w) = ( ∂ ∂w0 F(w), ∂ ∂w1 F(w), . . . , ∂ ∂wm F(w)) Machine Learning for Language Technology 21(36)
  22. 22. Structured Prediction Conditional Random Fields (iii) Problem: sum over output space Y ∂ ∂wi F(w) = X t fi (xt , yt ) − X t X y ∈Y P(y |xt )fi (xt , y ) = X t X j=1 fi (xt , yt,j−1, yt,j ) − X t X y ∈Y X j=1 P(y |xt )fi (xt , yj−1, yj ) Can easily calculate first term – just empirical counts What about the second term? Machine Learning for Language Technology 22(36)
  23. 23. Structured Prediction Conditional Random Fields (iv) Problem: sum over output space Y t y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) We need to show we can compute it for arbitrary xt y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) Solution: the forward-backward algorithm Machine Learning for Language Technology 23(36)
  24. 24. Structured Prediction Forward Algorithm (i) Let αm u be the forward scores Let |xt| = n αm u is the sum over all labelings of x0 . . . xm such that ym = u αm u = |y |=m, ym=u ew·f(xt ,y ) = |y |=m ym=u e P j=1 w·f(xt ,yj−1,yj ) i.e., the sum of all labelings of length m, ending at position m with label u Note then that Zxt = y ew·f(xt ,y ) = u αn u Machine Learning for Language Technology 24(36)
  25. 25. Structured Prediction Forward Algorithm (ii) We can fill in α as follows: α0 u = 1.0 ∀u αm u = v αm−1 v × ew·f(xt ,v,u) Machine Learning for Language Technology 25(36)
  26. 26. Structured Prediction Backward Algorithm Let βm u be the symmetric backward scores i.e., the sum over all labelings of xm . . . xn such that ym = u We can fill in β as follows: βn u = 1.0 ∀u βm u = v βm+1 v × ew·f(xt ,u,v) Note: β is overloaded – different from back-pointers Machine Learning for Language Technology 26(36)
  27. 27. Structured Prediction Conditional Random Fields - Final Let’s show we can compute it for arbitrary xt y ∈Y j=1 P(y |xt)fi (xt, yj−1, yj ) So we can re-write it as: j=1 αj−1 yj−1 × ew·f(xt ,yj−1,yj ) × βj yj Zxt fi (xt, yj−1, yj ) Forward-backward can calculate partial derivatives efficiently Machine Learning for Language Technology 27(36)
  28. 28. Structured Prediction Conditional Random Fields Summary Inference: Viterbi Learning: Use the forward-backward algorithm What about not sequential problems Context-free parsing – can use inside-outside algorithm General problems – message passing & belief propagation Machine Learning for Language Technology 28(36)
  29. 29. Structured Prediction Case Study: Dependency Parsing Given an input sentence x, predict syntactic dependencies y Machine Learning for Language Technology 29(36)
  30. 30. Structured Prediction Model 1: Arc-Factored Graph-Based Parsing y = arg max y w · f(x, y) = arg max y (i,j)∈y w · f(i, j) (i, j) ∈ y means xi → xj , i.e., a dependency from xi to xj Solving the argmax w · f(i, j) is weight of arc A dependency tree is a spanning tree of a dense graph over x Use max spanning tree algorithms for inference Machine Learning for Language Technology 30(36)
  31. 31. Structured Prediction Defining f(i, j) Can contain any feature over arc or the input sentence Some example features Identities of xi and xj Their part-of-speech tags The part-of-speech of surrounding words The distance between xi and xj ... Machine Learning for Language Technology 31(36)
  32. 32. Structured Prediction Empirical Results Spanning tree dependency parsing results (McDonald 2006) Trained using MIRA (online SVMs) English Czech Chinese Accuracy Complete Accuracy Complete Accuracy Complete 90.7 36.7 84.1 32.2 79.7 27.2 Simple structured linear classifier Near state-of-the-art performance for many languages Higher-order models give higher accuracy Machine Learning for Language Technology 32(36)
  33. 33. Structured Prediction Model 2: Transition-Based Parsing y = arg max y w · f(x, y) = arg max y t(s)∈T(y) w · f(s, t) t(s) ∈ T(y) means that the derivation of y includes the application of transition t to state s Solving the argmax w · f(s, t) is score of transition t in state s Use beam search to find best derivation from start state s0 Machine Learning for Language Technology 33(36)
  34. 34. Structured Prediction Defining f(s, t) Can contain any feature over parser states Some example features Identities of words in s (e.g., top of stack, head of queue) Their part-of-speech tags Their head and dependents (and their part-of-speech tags) The number of dependents of words in s ... Machine Learning for Language Technology 34(36)
  35. 35. Structured Prediction Empirical Results Transition-based dependency parsing with beam search(**) (Zhang and Nivre 2011) Trained using perceptron English Chinese Accuracy Complete Accuracy Complete 92.9 48.0 86.0 36.9 Simple structured linear classifier State-of-the-art performance with rich non-local features (**) Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-first search that reduces its memory requirements. It only finds an approxiamate solution Machine Learning for Language Technology 35(36)
  36. 36. Structured Prediction Structured Prediction Summary Can’t use multiclass algorithms – search space too large Solution: factor representations Can allow for efficient inference and learning Showed for sequence learning: Viterbi + forward-backward But also true for other structures CFG parsing: CKY + inside-outside Dependency parsing: spanning tree algorithms or beam search General graphs: message passing and belief propagation Machine Learning for Language Technology 36(36)

×