Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Structured Prediction: A Large Margin Approach   Ben Taskar University of Pennsylvania
Acknowledgments <ul><li>Drago Anguelov </li></ul><ul><li>Vassil Chatalbashev </li></ul><ul><li>Carlos Guestrin </li></ul><...
Structured Prediction <ul><li>Prediction of complex outputs </li></ul><ul><ul><li>Structured outputs: multivariate, correl...
Handwriting Recognition brace Sequential structure x y
Object Segmentation Spatial structure x y
Natural Language Parsing The screen was  a sea of red Recursive structure x y
Bilingual Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles pr...
Protein Structure and Disulfide Bridges Protein: 1IMT AVITGA C ERDLQ C G KGT CC AVSLWIKSV RV C TPVGTSGED C H PASHKIPFSGQRM...
Local Prediction <ul><li>Classify using local information </li></ul><ul><li>   Ignores correlations & constraints! </li><...
Local Prediction building tree shrub ground
Structured Prediction <ul><li>Use local information  </li></ul><ul><li>Exploit correlations </li></ul>b r e a c
Structured Prediction building tree shrub ground
Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs...
Structured Models <ul><li>Mild assumption:  </li></ul><ul><li>linear combination </li></ul>space of feasible outputs scori...
Chain Markov Net  (aka CRF*) y x *Lafferty et al. 01 a-z a-z a-z a-z a-z
Chain Markov Net  (aka CRF*) y x *Lafferty et al. 01 a-z a-z a-z a-z a-z
Associative Markov Nets Point features spin-images, point height Edge features length of edge, edge orientation  y j y k ...
CFG Parsing #(NP    DT NN) … #(PP    IN NP) … #(NN    ‘sea’)
Bilingual Word Alignment <ul><ul><li>position </li></ul></ul><ul><ul><li>orthography </li></ul></ul><ul><ul><li>associatio...
Disulfide Bonds: Non-bipartite Matching 1 2 3 4 6 5 RS CC P C YWGG C PWGQN C YPEG C SGPKV 1  2  3  4  5  6  6 1 2 4 5 3 Fa...
Scoring Function RS CC P C YWGG C PWGQN C YPEG C SGPKV 1  2  3  4  5  6  RS CC P C YWGG C PWGQ N C YPEG C SGPK V 1  2  3  ...
Structured Models <ul><li>Mild assumptions:  </li></ul><ul><li>linear combination  </li></ul><ul><li>sum of part scores </...
Supervised Structured Prediction Learning Prediction Estimate   w Example: Weighted matching Generally: Combinatorial opti...
Local Estimation <ul><li>Treat edges as independent decisions </li></ul><ul><li>Estimate  w  locally, use globally </li></...
Conditional Likelihood Estimation <ul><li>Estimate  w  jointly: </li></ul><ul><li>Denominator is  #P-complete </li></ul><u...
Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs...
<ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>OCR Example a lot! … “ brace” “ brace” “ aa a aa ” “ brace” “ a...
<ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>Parsing Example ‘ It was red’ a lot! … ‘ It was red’ ‘ It was r...
<ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>Alignment Example ‘ What is the’ ‘ Quel est le’ a lot! … ‘ What...
Structured Loss b  c  a  r  e  b  r  o  r  e  b  r  o  c  e b  r  a  c  e  2  2  1 0  ‘ What is the’ ‘ Quel est le’ 0  1  ...
Large margin estimation <ul><li>Given training examples  , we want: </li></ul><ul><li>Maximize margin </li></ul><ul><li>Mi...
Large margin estimation <ul><li>Eliminate </li></ul><ul><li>Add slacks  for inseparable case (hinge loss) </li></ul>
Large margin estimation <ul><li>Brute force enumeration </li></ul><ul><li>Min-max formulation </li></ul><ul><ul><li>‘ Plug...
Min-max formulation LP Inference Structured loss (Hamming): Inference discrete optim. Key step: continuous optim.
<ul><li>Simple iterative method </li></ul><ul><li>Unstable for structured output: fewer instances, big updates </li></ul><...
<ul><li>Add most violated constraint </li></ul><ul><li>Handles several  more general loss  functions </li></ul><ul><li>Nee...
Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs...
Matching Inference LP Has integral solutions  z ( A  is totally unimodular) degree What is  the anticipated cost of collec...
y    z  Map for Markov Nets 0 0 0 0 0 .  .  .  0 . 0 0 0 . 1 0 0 : 0 1 0 : 1 0 0 : 0 1 0 : 1 0 0 : 1 0 z : b a 0 0 0 0 0 ...
Markov Net Inference LP Has integral solutions  z  for chains, (hyper)trees Can be fractional for untriangulated networks ...
Associative MN Inference LP <ul><li>For K=2, solutions are always integral (optimal) </li></ul><ul><li>For K>2, within fac...
CFG Chart <ul><li>CNF tree = set of two types of parts:  </li></ul><ul><ul><li>Constituents (A, s, e) </li></ul></ul><ul><...
CFG Inference LP inside outside Has integral solutions  z root
LP Duality <ul><li>Linear programming duality </li></ul><ul><ul><li>Variables    constraints </li></ul></ul><ul><ul><li>C...
Min-max Formulation LP duality
Min-max formulation summary <ul><li>Formulation produces concise QP for </li></ul><ul><ul><li>Low-treewidth Markov network...
Unfactored Primal/Dual QP duality Exponentially  many constraints/variables
Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables     correspond to...
The Connection b  c  a  r  e  b  r  o  r  e  b  r  o  c  e b  r  a  c  e  r c a o c r .2 .15 .25 .4 .2 .35 .65 .8 .4 .6 1 ...
Duals and Kernels <ul><li>Kernel trick  works: </li></ul><ul><ul><li>Factored dual </li></ul></ul><ul><ul><li>Local functi...
3D Mapping Laser Range Finder GPS IMU Data provided by:  Michael Montemerlo & Sebastian Thrun Label:  ground, building, tr...
 
 
 
 
Segmentation results Hand labeled 180K test points 93% M 3 N 73% V-SVM 68% SVM Accuracy Model
Fly-through
LAGRbot: Real-time Navigation LAGRbot:  Paul Vernaza & Dan Lee Range of stereo vision limited  to approximately 15 m or less
LAGRbot: Real-time Navigation 160x120 images: Real time prediction/learning (~100ms) Current work with Paul Vernaza, Dan L...
Hypertext Classification <ul><li>WebKB dataset </li></ul><ul><ul><li>Four CS department websites:  1300 pages/3500 links <...
Word Alignment Results Data:  [Hansards – Canadian Parliament]   Features induced on    1 mil unsupervised sentences Trai...
Modeling First Order Effects <ul><li>QAP NP-complete </li></ul><ul><li>Sentences (  30 words,   1k vars)    few seconds...
Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs...
Certificate formulation <ul><li>Non-bipartite matchings: </li></ul><ul><ul><li>O(n 3 ) combinatorial algorithm </li></ul><...
Certificate for non-bipartite matching <ul><li>Alternating cycle:  </li></ul><ul><ul><li>Every other edge is in matching <...
Certificate for non-bipartite matching <ul><li>Pick any node r as root </li></ul><ul><li>= length of shortest alternating ...
Certificate formulation <ul><li>Formulation produces compact QP for </li></ul><ul><ul><li>Spanning trees </li></ul></ul><u...
Disulfide Bonding Prediction <ul><li>Data  [Swiss Prot 39] </li></ul><ul><ul><li>450 sequences (4-10 cysteines) </li></ul>...
Formulation summary <ul><li>Brute force enumeration </li></ul><ul><li>Min-max formulation </li></ul><ul><ul><li>‘ Plug-in’...
Scalable Algorithms <ul><li>Convex quadratic program </li></ul><ul><li># variables and constraints  linear   in </li></ul>...
Structured Extragradient <ul><li>Extragradient   method  [Korpelevich 76, Nesterov 03] </li></ul><ul><ul><li>Linear conver...
Saddle-point Problem
Extragradient Method [Korpelevich76] Prediction: Correction: = Euclidean projection = step size Theorem: Extragradient con...
for Bipartite Matchings: Min Cost Flow <ul><li>Min-cost quadratic flow computes projection </li></ul><ul><ul><li>O(N 1.5 )...
Structured Extragradient <ul><li>Extragradient   method  [Korpelevich 76, Nesterov 03] </li></ul><ul><ul><li>Linear conver...
Other approaches <ul><li>Online methods </li></ul><ul><ul><li>Online updates with respect to most violated constraints [Cr...
Generalization Bounds “ If the past any indication of the future, he’ll have a cruller.”
Generalization Bounds
Several Pointers <ul><li>Perceptron bound  [Colllins 01]   </li></ul><ul><ul><li>Assume separability with margin     </l...
Open Questions for Large-Margin Estimation <ul><li>Statistical consistency </li></ul><ul><ul><li>Hinge loss not consistent...
Learning with LP relaxations <ul><li>Does constant factor approximate inference guarantee  anything  a-priori about learni...
References <ul><li>Edited collection: </li></ul><ul><li>G. Bakir+al 07   Predicting Structured Data   MIT Press </li></ul>...
Thanks!
Segmentation Model    Min-Cut <ul><li>Computing  is hard in general,  but   </li></ul><ul><li>if edge potentials  attract...
Upcoming SlideShare
Loading in …5
×

NIPS2007: structured prediction

782 views

Published on

Published in: Education
  • Be the first to comment

NIPS2007: structured prediction

  1. 1. Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania
  2. 2. Acknowledgments <ul><li>Drago Anguelov </li></ul><ul><li>Vassil Chatalbashev </li></ul><ul><li>Carlos Guestrin </li></ul><ul><li>Michael Jordan </li></ul>Dan Klein Daphne Koller Simon Lacoste-Julien Paul Vernaza
  3. 3. Structured Prediction <ul><li>Prediction of complex outputs </li></ul><ul><ul><li>Structured outputs: multivariate, correlated, constrained </li></ul></ul><ul><li>Novel, general way to solve many learning problems </li></ul>
  4. 4. Handwriting Recognition brace Sequential structure x y
  5. 5. Object Segmentation Spatial structure x y
  6. 6. Natural Language Parsing The screen was a sea of red Recursive structure x y
  7. 7. Bilingual Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? x y What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions , quel est le coût prévu de perception de les droits ? Combinatorial structure
  8. 8. Protein Structure and Disulfide Bridges Protein: 1IMT AVITGA C ERDLQ C G KGT CC AVSLWIKSV RV C TPVGTSGED C H PASHKIPFSGQRMH HT C P C APNLA C VQT SPKKFK C LSK
  9. 9. Local Prediction <ul><li>Classify using local information </li></ul><ul><li> Ignores correlations & constraints! </li></ul>b r e a c
  10. 10. Local Prediction building tree shrub ground
  11. 11. Structured Prediction <ul><li>Use local information </li></ul><ul><li>Exploit correlations </li></ul>b r e a c
  12. 12. Structured Prediction building tree shrub ground
  13. 13. Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs) </li></ul></ul><ul><ul><li>Associative Markov networks (Special MRFs) </li></ul></ul><ul><ul><li>Matchings </li></ul></ul><ul><li>Structured large margin estimation </li></ul><ul><ul><li>Margins and structure </li></ul></ul><ul><ul><li>Min-max formulation </li></ul></ul><ul><ul><li>Linear programming inference </li></ul></ul><ul><ul><li>Certificate formulation </li></ul></ul>
  14. 14. Structured Models <ul><li>Mild assumption: </li></ul><ul><li>linear combination </li></ul>space of feasible outputs scoring function
  15. 15. Chain Markov Net (aka CRF*) y x *Lafferty et al. 01 a-z a-z a-z a-z a-z
  16. 16. Chain Markov Net (aka CRF*) y x *Lafferty et al. 01 a-z a-z a-z a-z a-z
  17. 17. Associative Markov Nets Point features spin-images, point height Edge features length of edge, edge orientation y j y k  jk  j “ associative” restriction
  18. 18. CFG Parsing #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’)
  19. 19. Bilingual Word Alignment <ul><ul><li>position </li></ul></ul><ul><ul><li>orthography </li></ul></ul><ul><ul><li>association </li></ul></ul>What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions , quel est le co û t prévu de perception de le droits ? j k
  20. 20. Disulfide Bonds: Non-bipartite Matching 1 2 3 4 6 5 RS CC P C YWGG C PWGQN C YPEG C SGPKV 1 2 3 4 5 6 6 1 2 4 5 3 Fariselli & Casadio `01, Baldi et al. ‘04
  21. 21. Scoring Function RS CC P C YWGG C PWGQN C YPEG C SGPKV 1 2 3 4 5 6 RS CC P C YWGG C PWGQ N C YPEG C SGPK V 1 2 3 4 5 6 <ul><ul><li>amino acid identities </li></ul></ul><ul><ul><li>phys/chem properties </li></ul></ul>1 2 3 4 6 5
  22. 22. Structured Models <ul><li>Mild assumptions: </li></ul><ul><li>linear combination </li></ul><ul><li>sum of part scores </li></ul>space of feasible outputs scoring function
  23. 23. Supervised Structured Prediction Learning Prediction Estimate w Example: Weighted matching Generally: Combinatorial optimization Data Model: Likelihood (can be intractable) Margin Local (ignores structure)
  24. 24. Local Estimation <ul><li>Treat edges as independent decisions </li></ul><ul><li>Estimate w locally, use globally </li></ul><ul><ul><li>E.g., naïve Bayes, SVM, logistic regression </li></ul></ul><ul><ul><li>Cf. [Matusov+al, 03] for matchings </li></ul></ul><ul><ul><li>Simple and cheap </li></ul></ul><ul><ul><li>Not well-calibrated for matching model </li></ul></ul><ul><ul><li>Ignores correlations & constraints </li></ul></ul>Data Model:
  25. 25. Conditional Likelihood Estimation <ul><li>Estimate w jointly: </li></ul><ul><li>Denominator is #P-complete </li></ul><ul><ul><li>[Valiant 79, Jerrum & Sinclair 93] </li></ul></ul><ul><li>Tractable model, intractable learning </li></ul><ul><li>Need tractable learning method </li></ul><ul><li> margin-based estimation </li></ul>Data Model:
  26. 26. Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs) </li></ul></ul><ul><ul><li>Associative Markov networks (Special MRFs) </li></ul></ul><ul><ul><li>Matchings </li></ul></ul><ul><li>Structured large margin estimation </li></ul><ul><ul><li>Margins and structure </li></ul></ul><ul><ul><li>Min-max formulation </li></ul></ul><ul><ul><li>Linear programming inference </li></ul></ul><ul><ul><li>Certificate formulation </li></ul></ul>
  27. 27. <ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>OCR Example a lot! … “ brace” “ brace” “ aa a aa ” “ brace” “ aa a ab ” “ brace” “ zzzzz ”
  28. 28. <ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>Parsing Example ‘ It was red’ a lot! … ‘ It was red’ ‘ It was red’ ‘ It was red’ ‘ It was red’ ‘ It was red’ ‘ It was red’ S A B C D S A B D F S A B C D S E F G H S A B C D S A B C D S A B C D
  29. 29. <ul><li>We want: </li></ul><ul><li>Equivalently: </li></ul>Alignment Example ‘ What is the’ ‘ Quel est le’ a lot! … ‘ What is the’ ‘ Quel est le’ ‘ What is the’ ‘ Quel est le’ ‘ What is the’ ‘ Quel est le’ 1 2 3 1 2 3 ‘ What is the’ ‘ Quel est le’ ‘ What is the’ ‘ Quel est le’ ‘ What is the’ ‘ Quel est le’ 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
  30. 30. Structured Loss b c a r e b r o r e b r o c e b r a c e 2 2 1 0 ‘ What is the’ ‘ Quel est le’ 0 1 2 2 ‘ It was red’ 0 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 S A E C D S B E A C S B D A C S A B C D
  31. 31. Large margin estimation <ul><li>Given training examples , we want: </li></ul><ul><li>Maximize margin </li></ul><ul><li>Mistake weighted margin: </li></ul># of mistakes in y *Collins 02, Altun et al 03, Taskar 03
  32. 32. Large margin estimation <ul><li>Eliminate </li></ul><ul><li>Add slacks for inseparable case (hinge loss) </li></ul>
  33. 33. Large margin estimation <ul><li>Brute force enumeration </li></ul><ul><li>Min-max formulation </li></ul><ul><ul><li>‘ Plug-in’ linear program for inference </li></ul></ul>
  34. 34. Min-max formulation LP Inference Structured loss (Hamming): Inference discrete optim. Key step: continuous optim.
  35. 35. <ul><li>Simple iterative method </li></ul><ul><li>Unstable for structured output: fewer instances, big updates </li></ul><ul><ul><li>May not converge if non-separable </li></ul></ul><ul><ul><li>Noisy </li></ul></ul><ul><li>Voted / averaged perceptron [Freund & Schapire 99, Collins 02] </li></ul><ul><ul><li>Regularize / reduce variance by aggregating over iterations </li></ul></ul>Alternatives: Perceptron
  36. 36. <ul><li>Add most violated constraint </li></ul><ul><li>Handles several more general loss functions </li></ul><ul><li>Need to re-solve QP many times </li></ul><ul><li>Theorem: Only polynomial # of constraints needed to achieve  -error [Tsochantaridis et al, 04] </li></ul><ul><li>Worst case # of constraints larger than factored </li></ul>Alternatives: Constraint Generation [Collins 02; Altun et al, 03]
  37. 37. Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs) </li></ul></ul><ul><ul><li>Associative Markov networks (Special MRFs) </li></ul></ul><ul><ul><li>Matchings </li></ul></ul><ul><li>Structured large margin estimation </li></ul><ul><ul><li>Margins and structure </li></ul></ul><ul><ul><li>Min-max formulation </li></ul></ul><ul><ul><li>Linear programming inference </li></ul></ul><ul><ul><li>Certificate formulation </li></ul></ul>
  38. 38. Matching Inference LP Has integral solutions z ( A is totally unimodular) degree What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions , quel est le co û t prévu de perception de le droits ? j k [Nemhauser+Wolsey 88] Need Hamming-like loss
  39. 39. y  z Map for Markov Nets 0 0 0 0 0 . . . 0 . 0 0 0 . 1 0 0 : 0 1 0 : 1 0 0 : 0 1 0 : 1 0 0 : 1 0 z : b a 0 0 0 0 0 . . . 0 . 0 1 0 . 0 0 0 0 0 0 0 . . . 0 . 0 0 0 . 1 0 0 0 0 0 0 . . . 0 . 1 0 0 . 0 0 z : b a z . b a z . b a z . b a z . b a
  40. 40. Markov Net Inference LP Has integral solutions z for chains, (hyper)trees Can be fractional for untriangulated networks normalization agreement [Chekuri+al 01, Wainright+al 02] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
  41. 41. Associative MN Inference LP <ul><li>For K=2, solutions are always integral (optimal) </li></ul><ul><li>For K>2, within factor of 2 of optimal (results for larger cliques) </li></ul>“ associative” restriction [Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04] 0 0 1 0 0 0 1 0 0 0 1 0
  42. 42. CFG Chart <ul><li>CNF tree = set of two types of parts: </li></ul><ul><ul><li>Constituents (A, s, e) </li></ul></ul><ul><ul><li>CF-rules (A  B C, s, m, e) </li></ul></ul>
  43. 43. CFG Inference LP inside outside Has integral solutions z root
  44. 44. LP Duality <ul><li>Linear programming duality </li></ul><ul><ul><li>Variables  constraints </li></ul></ul><ul><ul><li>Constraints  variables </li></ul></ul><ul><li>Optimal values are the same </li></ul><ul><ul><li>When both feasible regions are bounded </li></ul></ul>
  45. 45. Min-max Formulation LP duality
  46. 46. Min-max formulation summary <ul><li>Formulation produces concise QP for </li></ul><ul><ul><li>Low-treewidth Markov networks </li></ul></ul><ul><ul><li>Associative MNs (K=2) </li></ul></ul><ul><ul><li>Context free grammars </li></ul></ul><ul><ul><li>Bipartite matchings </li></ul></ul><ul><ul><li>Approximate for untriangulated MNs, AMNs with K>2 </li></ul></ul>*Taskar et al 04
  47. 47. Unfactored Primal/Dual QP duality Exponentially many constraints/variables
  48. 48. Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables  correspond to a decomposition of  variables of the flat case
  49. 49. The Connection b c a r e b r o r e b r o c e b r a c e r c a o c r .2 .15 .25 .4 .2 .35 .65 .8 .4 .6 1 b 1 e 2 2 1 0 
  50. 50. Duals and Kernels <ul><li>Kernel trick works: </li></ul><ul><ul><li>Factored dual </li></ul></ul><ul><ul><li>Local functions (log-potentials) can use kernels </li></ul></ul>
  51. 51. 3D Mapping Laser Range Finder GPS IMU Data provided by: Michael Montemerlo & Sebastian Thrun Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
  52. 56. Segmentation results Hand labeled 180K test points 93% M 3 N 73% V-SVM 68% SVM Accuracy Model
  53. 57. Fly-through
  54. 58. LAGRbot: Real-time Navigation LAGRbot: Paul Vernaza & Dan Lee Range of stereo vision limited to approximately 15 m or less
  55. 59. LAGRbot: Real-time Navigation 160x120 images: Real time prediction/learning (~100ms) Current work with Paul Vernaza, Dan Lee 8% Structured 17% Local Error Model
  56. 60. Hypertext Classification <ul><li>WebKB dataset </li></ul><ul><ul><li>Four CS department websites: 1300 pages/3500 links </li></ul></ul><ul><ul><li>Classify each page: faculty, course, student, project, other </li></ul></ul><ul><ul><li>Train on three universities/test on fourth </li></ul></ul>53% error reduction over SVMs 38% error reduction over RMNs relaxed LP *Taskar et al 02 better loopy belief propagation
  57. 61. Word Alignment Results Data: [Hansards – Canadian Parliament] Features induced on  1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges) [Taskar+al 05] *Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06] *Error Model 6.5 GIZA/IBM4 [Och & Ney 03] 4.5 +Our approach+QAP 5.4 +Local learning+matching 4.9 +Our approach
  58. 62. Modeling First Order Effects <ul><li>QAP NP-complete </li></ul><ul><li>Sentences (  30 words,  1k vars)  few seconds (Mosek) </li></ul><ul><li>Learning: use LP relaxation </li></ul><ul><li>Testing: using LP, 83.5% sentences, 99.85% edges integral </li></ul>Local fertility Local inversion Monotonicity
  59. 63. Outline <ul><li>Structured prediction models </li></ul><ul><ul><li>Sequences (CRFs) </li></ul></ul><ul><ul><li>Trees (CFGs) </li></ul></ul><ul><ul><li>Associative Markov networks (Special MRFs) </li></ul></ul><ul><ul><li>Matchings </li></ul></ul><ul><li>Structured large margin estimation </li></ul><ul><ul><li>Margins and structure </li></ul></ul><ul><ul><li>Min-max formulation </li></ul></ul><ul><ul><li>Linear programming inference </li></ul></ul><ul><ul><li>Certificate formulation </li></ul></ul>
  60. 64. Certificate formulation <ul><li>Non-bipartite matchings: </li></ul><ul><ul><li>O(n 3 ) combinatorial algorithm </li></ul></ul><ul><ul><li>No polynomial-size LP known </li></ul></ul><ul><li>Spanning trees </li></ul><ul><ul><li>No polynomial-size LP known </li></ul></ul><ul><ul><li>Simple certificate of optimality </li></ul></ul><ul><li>Intuition: </li></ul><ul><ul><li>Verifying optimality easier than optimizing </li></ul></ul><ul><li>Compact optimality condition of wrt. </li></ul>ij kl 1 2 3 4 6 5
  61. 65. Certificate for non-bipartite matching <ul><li>Alternating cycle: </li></ul><ul><ul><li>Every other edge is in matching </li></ul></ul><ul><li>Augmenting alternating cycle: </li></ul><ul><ul><li>Score of edges not in matching greater than edges in matching </li></ul></ul><ul><li>Negate score of edges not in matching </li></ul><ul><ul><li>Augmenting alternating cycle = negative length alternating cycle </li></ul></ul><ul><li>Matching is optimal  no negative alternating cycles </li></ul>Edmonds ‘65 1 2 3 4 6 5
  62. 66. Certificate for non-bipartite matching <ul><li>Pick any node r as root </li></ul><ul><li>= length of shortest alternating </li></ul><ul><li>path from r to j </li></ul><ul><li>Triangle inequality: </li></ul><ul><li>Theorem: </li></ul><ul><li>No negative length cycle  distance function d exists </li></ul><ul><li>Can be expressed as linear constraints: </li></ul><ul><li>O(n) distance variables, O(n 2 ) constraints </li></ul>1 2 3 4 6 5
  63. 67. Certificate formulation <ul><li>Formulation produces compact QP for </li></ul><ul><ul><li>Spanning trees </li></ul></ul><ul><ul><li>Non-bipartite matchings </li></ul></ul><ul><ul><li>Any problem with compact optimality condition </li></ul></ul>*Taskar et al. ‘05
  64. 68. Disulfide Bonding Prediction <ul><li>Data [Swiss Prot 39] </li></ul><ul><ul><li>450 sequences (4-10 cysteines) </li></ul></ul><ul><ul><li>Features: </li></ul></ul><ul><ul><ul><li>windows around C-C pair </li></ul></ul></ul><ul><ul><ul><li>physical/chemical properties </li></ul></ul></ul>[Taskar+al 05] AVITGA ERDLQ GKGT AVSLWIKSVRV TPVGTSGED HPASHKIPFSGQRMHHT P APNLA VQTSPKKFK LSK C C CC C C C C C C *Accuracy: % proteins with all correct bonds 52% Recursive Neural Net [Baldi+al’04] 55% Our approach (certificate) 41% Local learning+matching *Acc Model
  65. 69. Formulation summary <ul><li>Brute force enumeration </li></ul><ul><li>Min-max formulation </li></ul><ul><ul><li>‘ Plug-in’ convex program for inference </li></ul></ul><ul><li>Certificate formulation </li></ul><ul><ul><li>Directly guarantee optimality of </li></ul></ul>
  66. 70. Scalable Algorithms <ul><li>Convex quadratic program </li></ul><ul><li># variables and constraints linear in </li></ul><ul><li># parameters, edges </li></ul><ul><li>Can solve using off-the-shelf software </li></ul><ul><ul><li>Matlab, CPLEX, Mosek, etc. </li></ul></ul><ul><ul><li>Superlinear convergence </li></ul></ul><ul><li>Problem: linear is too large </li></ul><ul><ul><li>Second-order methods run out of memory (quadratic) </li></ul></ul><ul><li>Need scalable memory-efficient methods </li></ul><ul><ul><li>Space/time tradeoff </li></ul></ul><ul><ul><li>Structured SMO [Taskar+al 04] </li></ul></ul><ul><ul><li>Structured exponentiated gradient [Bartlett+al 04, Collins+al 07] </li></ul></ul><ul><ul><li>Don’t work for matchings, min-cuts </li></ul></ul>
  67. 71. Structured Extragradient <ul><li>Extragradient method [Korpelevich 76, Nesterov 03] </li></ul><ul><ul><li>Linear convergence </li></ul></ul><ul><ul><li>Gradient step </li></ul></ul><ul><ul><li>Projection step </li></ul></ul>[Taskar+al 06] j s t k All capacities = 1 quel est le co û t prévu What is the anticipated cost Flow cost
  68. 72. Saddle-point Problem
  69. 73. Extragradient Method [Korpelevich76] Prediction: Correction: = Euclidean projection = step size Theorem: Extragradient converges linearly Key computation is Euclidean projection usually easy harder
  70. 74. for Bipartite Matchings: Min Cost Flow <ul><li>Min-cost quadratic flow computes projection </li></ul><ul><ul><li>O(N 1.5 ) complexity for fixed precision (N=num edges) </li></ul></ul><ul><li>Reduction to flow for min-cuts also possible </li></ul>[Taskar+al 06] j s t k All capacities = 1 quel est le co û t prévu What is the anticipated cost Flow cost
  71. 75. Structured Extragradient <ul><li>Extragradient method [Korpelevich 76, Nesterov 03] </li></ul><ul><ul><li>Linear convergence </li></ul></ul><ul><ul><li>Key computation: projection </li></ul></ul><ul><ul><li>min-cost quadratic flow for matchings & cuts </li></ul></ul><ul><li>Extensions (using Bregman divergence) </li></ul><ul><ul><li> dynamic programming for decomposable models </li></ul></ul><ul><li>“ Online-envy” – want memory proportional to # parameters  independent of # examples </li></ul><ul><ul><li> solves problems with  million edges </li></ul></ul>[Taskar+al 06]
  72. 76. Other approaches <ul><li>Online methods </li></ul><ul><ul><li>Online updates with respect to most violated constraints [Crammer+al 05,06] </li></ul></ul><ul><li>Regression based methods </li></ul><ul><ul><li>Regression from input to transformed output space [Cortes+al 07] </li></ul></ul><ul><li>Learning to search </li></ul><ul><ul><li>Learn classifier to guide local search for structured solution [Daume+al 05] </li></ul></ul><ul><li>Many others </li></ul>
  73. 77. Generalization Bounds “ If the past any indication of the future, he’ll have a cruller.”
  74. 78. Generalization Bounds
  75. 79. Several Pointers <ul><li>Perceptron bound [Colllins 01] </li></ul><ul><ul><li>Assume separability with margin  </li></ul></ul><ul><ul><li>Bound on 0-1 loss </li></ul></ul><ul><li>Covering-number bound [Taskar+al 03] </li></ul><ul><ul><li>Bound on Hamming loss </li></ul></ul><ul><ul><li>Logarithmic dependence on # variables in each y </li></ul></ul><ul><li>Regret Bounds [Crammer+al 06] </li></ul><ul><ul><li>Online-style guarantees for more general loss </li></ul></ul><ul><li>PAC-Bayes bound [McAllester 07] </li></ul><ul><ul><li>Tighter analysis, consistency </li></ul></ul><ul><li>Bounds for Learning with Approximate Inference </li></ul><ul><ul><li>[Kulesza & Pereira, Today] </li></ul></ul>
  76. 80. Open Questions for Large-Margin Estimation <ul><li>Statistical consistency </li></ul><ul><ul><li>Hinge loss not consistent for non-binary output </li></ul></ul><ul><ul><li>[See Tewari & Bartlett 05, McAllester 07] </li></ul></ul><ul><li>Semi-supervised </li></ul><ul><ul><li>Laplacian-regularization [Altun+McAllester 05] </li></ul></ul><ul><ul><li>Co-regularization [Brefeld+al 05] </li></ul></ul><ul><li>Latent variables </li></ul><ul><ul><li>Machine Translation [Liang+al 06] </li></ul></ul><ul><ul><li>CCG Parsing to Logical Form [Zettlemoyer+Collins 07] </li></ul></ul><ul><li>Learning with approximate inference </li></ul>
  77. 81. Learning with LP relaxations <ul><li>Does constant factor approximate inference guarantee anything a-priori about learning? </li></ul><ul><li>No [See Kulesza & Pereira, tonight] </li></ul><ul><ul><li>Simple 3-node counter example </li></ul></ul><ul><ul><li>Separable with exact inference, not separable with approximate </li></ul></ul><ul><li>Question: </li></ul><ul><ul><li>What other (stronger?) approximate inference guarantees will translate into learning guarantees? </li></ul></ul>
  78. 82. References <ul><li>Edited collection: </li></ul><ul><li>G. Bakir+al 07 Predicting Structured Data MIT Press </li></ul><ul><li>Code: </li></ul><ul><ul><li>SVM struct by Thorsten Joachims </li></ul></ul><ul><li>Slides, more papers at: </li></ul><ul><ul><li>http:// www.cis.upenn.edu/~taskar </li></ul></ul>
  79. 83. Thanks!
  80. 84. Segmentation Model  Min-Cut <ul><li>Computing is hard in general, but </li></ul><ul><li>if edge potentials attractive  min-cut algorithm </li></ul><ul><li>Multiway-cut for multiclass case  use LP relaxation </li></ul>0 1 Local evidence Spatial smoothness [Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]

×