Bayesian word alignment for statistical machine translation

1,461 views

Published on

This is our reading group slides. For ACL 2011 Poster paper: Bayesian Word Alignment for Statistical Machine Translation

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,461
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bayesian word alignment for statistical machine translation

  1. 1. Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang 2011-10-13 I2R SMT-Reading Group
  2. 2. Paper info <ul><li>Bayesian Word Alignment for Statistical Machine Translation </li></ul><ul><li>ACL 2011 Short Paper </li></ul><ul><li>With Source Code in Perl on 379 lines </li></ul><ul><li>Authors </li></ul><ul><ul><li>Coskun Mermer </li></ul></ul><ul><ul><li>Murat Saraclar </li></ul></ul>
  3. 3. Core Idea <ul><li>Propose a Gibbs Sampler for Fully Bayesian Inference in IBM Model 1 </li></ul><ul><li>Result </li></ul><ul><ul><li>Outperform classical EM in BLEU up to 2.99 </li></ul></ul><ul><ul><li>Effectively address the rare word problem </li></ul></ul><ul><ul><li>Much smaller phrase table than EM </li></ul></ul>
  4. 4. Mathematics <ul><li>( E , F ): parallel corpus </li></ul><ul><li>e i , f j : i -th ( j -th) source (target) word in e ( f ), which contains I ( J ) words in corpus E ( F ). </li></ul><ul><li>e 0 : Each E sentence contains “null” word </li></ul><ul><li>V E ( V F ): size of source (target) vocabulary </li></ul><ul><li>a ( A ): alignment for sentence (corpus) </li></ul><ul><li>a j : f j has alignment a j for source word e aj </li></ul><ul><li>T : parameter table, size is V E x V F </li></ul><ul><li>t e,f = P(f|e) : word translation probability </li></ul>
  5. 5. IBM Model 1 T as a random variable
  6. 6. Dirichlet Distribution <ul><li>T ={ t e,f } is an exponential family distribution </li></ul><ul><li>Specifically being multinomial distribution </li></ul><ul><li>We choose the conjugate prior </li></ul><ul><li>In the case of Dirichlet Distribution for computational convenience </li></ul>
  7. 7. Dirichlet Distribution Each source word type te is a distribution over the target vocabulary, to be a Dirichlet distribution Avoid rare words acting as “garbage collectors”
  8. 8. Dirichlet Distribution sample the unknowns A and T in turn ¬j denotes the exclusion of the current value of aj .
  9. 9. Algorithm A can be arbitrary, but normal EM output is better
  10. 10. Results
  11. 13. Code View bayesalign.pl
  12. 14. Conclusions <ul><li>Outperform classical EM in BLEU up to 2.99 </li></ul><ul><li>Effectively address the rare word problem </li></ul><ul><li>Much smaller phrase table than EM </li></ul><ul><li>Shortcomings </li></ul><ul><ul><li>Too slow: 100 sentence pairs costs 18 mins </li></ul></ul><ul><ul><li>Maybe can be speedup by parallel computing </li></ul></ul>
  13. 15. 3

×