Upcoming SlideShare
×

# Efficient Decomposed Learning for Structured Prediction #icml2012

675

Published on

Published in: Technology, Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
675
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
8
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Efficient Decomposed Learning for Structured Prediction #icml2012

1. 1. Eﬃcient  Decomposed  Learning  for  Structured   Prediction Rajhans  Samdani,  Dan  Roth  (Illinois)   Presenter:  Yoh  Okuno
2. 2. Abstract•  Structured  learning  is  important  for  NLP  or  CV  •  Enormous  output  space  is  often  intractable  •  Proposed  DecL:  decomposed  learning  •  DecL  restrict  output  space  to  limited  part  •  Eﬃcient  and  accurate  in  experiment  and  theory
3. 3. Introduction•  What  is  Structured  learning?   –  Predict  output  variables  which  mutually  depend   –  Problem:  enormous  output  space  (exponential)  •  Applications:  NLP,  CV  or  Bioinformatics   –  Multi  label  document  classiﬁcation  (binary)  [Crammer+02]   –  Information  extraction  (sequence)  [Laﬀerty+  01]     –  Dependency  parsing  (tree)  [Koo+  10]
4. 4. Example:  Conditional  Random  Fields Output  Space [Laﬀerty+  01]
5. 5. Example:  Markov  Random  Fields Output  Space [Boykov+  98]
6. 6. Related  Work•  There  are  two  major  approaches  1.  Global  Learning  (GL):  Exact  but  Slow  [Tsochantaridis+  04]     –  Search  the  entre  output  space  in  learning  phase   –  Often  implemented  by  ILP  (Integer  Linear  Programming)  2.  Local  Learning  (LL):  Inaccurate  but  Fast   –  Ignore  structure  of  output  for  fast  search  •  DecL  is  exact  in  some  assumption  but  faster  than  LL
7. 7. Problem  Setting•  Given  training  data:   1 1 m m D = {(x , y ), ..., (x , y )}•  Output  y  is  represented  as  binary  variables   n y = {y1 , ..., yn } ∈ {0, 1}•  Model  is  linear  combination  of  features
8. 8. Structured  SVM [Tsochantaridis+  04]   •  Minimize  loss  function  below:   m l(w) = (max f (xj , y; w) + ∆(yj , y)) − f (xj , yj ; w)   y∈Y j=1   Rewarding  incorrect  output •  Generalized  hinge-­‐loss  to  multi  dimension   •  Regularization  term  is  omitted  for  space  issue   •  See  [Tokunaga  2011]  for  more  information
9. 9. Figure  1:  GL  and  DecL•  Search  neighborhood  around  gold  output   rather  than  entire  search  space
10. 10. DecL:  Decomposed  Learning •  Deﬁne  neighborhood  around  gold  output:   m l(w) = ( max j f (xj , y; w) + ∆(yj , y)) − f (xj , yj ; w) y∈nbr(y ) j=1 •  Note:  prediction  phase  need  global  search   •  How  can  we  deﬁne  neighborhood  for  learning?
11. 11. Sub  Gradient  Descent  for  DecL
12. 12. DecL-­‐k:  Special  Case  of  DecL•  Restrict  output  space  to  k-­‐dimension   –  Take  all  subsets  of  size  k  from  indices  of  y   –  Other  dimensions  are  equal  to  gold  output  •  Domain  knowledge  can  be  used  in  general   –  Group  coupled  variables  into  same  groups   –  Complexity  depends  on  size  of  decomposition
13. 13. Experiments  on  Synthetic  Data•  Compared  DecL,  LL  and  GL  (Oracle)  •  Synthetic  training  data:   –  10  binary  output  with  random  linear  constraints   –  20-­‐dimensional  input,  320  training  examples  •  Running  time  in  seconds:
14. 14. Multi  Label  Document  Classiﬁcation•  Dataset:  Reuter  corpus  •  Size:  6,000  documents  and  30  labels  •  DecL  performs  well  as  GL  and  6x  faster
15. 15. Information  Extraction:  Sequence  Tagging•  Data  1:  citation  recognition   –  Recognize  author,  title..  from  citation  text  •  Data  2:  advertisement  for  real  estate   –  Recognize  facility,  roommates..  from  ads
16. 16. Conclusion•  Structured  learning  has  a  tradeoﬀ  between  speed  and   accuracy  •  Decomposition  learning  (DecL)  splits  output  space   into  small  space  for  fast  inference  •  Fast  and  accurate  in  real  world  dataset  •  Theoretical  guarantee  for  exact  search  under  some   assumptions  (skipped)
17. 17. Reference•  [Collins+  02]  Discriminative  training  methods  for  hidden  Markov  models:  Theory   and  experiments  with  perceptron  algorithms.  •  [Laﬀerty+  01]  Conditional  random  ﬁelds:  Probabilistic  models  for  segmenting  and   labeling  sequence  data.  •  [Koo+  10]  Dual  decomposition  for  parsing  with  nonprojective  head  automata.  •  [Boykov+  98]  Markov  Random  Fields  with  Eﬃcient  Approximations.  •  [Tsochantaridis+  04]  Support  vector  machine  learning  for  interdependent  and   structured  output  spaces.  •  [Crammer+  02]  On  the  algorithmic  implementation  of  multiclass  kernel-­‐based   vector  machines.
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.