High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
Latent Domain Alignment Improves Word Alignment for Mixed Corpora
1. Latent Domain Word Alignment for Heterogeneous Corpora
Latent Domain Word Alignment for
Heterogeneous Corpora
Hoang Cuong
Joint work with Khalil Sima’an, appearing at NAACL 2015
ILLC, University of Amsterdam
1 / 21
2. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Bitext word alignment
Alignment task: identifying translation relationships
among the words in parallel sentences.
Proposed by
[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer],
turning out to be one of the most important tasks in
Natural Language Processing.
2 / 21
3. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Bitext word alignment
(a)
Bilingual Data
Alignment Model
Viterbi Decoding
Figure: Statistical Alignment Framework (a), c.f.,
[Brown et al.(1993)Brown, Pietra, Pietra, and Mercer] 3 / 21
4. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
SMT with Mix-of-Domains Haystack
We have Big DATA to train SMT systems.
Thanks to Europarl, UN, Common Crawl, ...
Data come from very different domains.
How does this affect the alignment accuracy?
Bigger data = producing better alignment quality
This in fact not so surprising!
In domain adaptation, [Moore and Lewis(2010),
Axelrod et al.(2011)Axelrod, He, and Gao,
Cuong and Sima’an(2014)] shows that bigger data does
not mean better translation!
4 / 21
5. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Word Alignment with Mix-of-Domains Haystack
Why? Haystack = too many different translations!
maestra → master (computer);
maestra → teacher (education); maestra → dean (education);
maestra → crack (other), maestra → ...
Suboptimal alignment quality has been repeatedly observed
[Gao et al.(2011)Gao, Lewis, Quirk, and Hwang,
Bach et al.(2008)Bach, Gao, and Vogel,
Banerjee et al.(2012)Banerjee, Naskar, Roturier, Way, and Genabith].
5 / 21
6. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
How to overcome this problem?
6 / 21
7. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Disentangling the Subdomains
(a) (b)
Bilingual Data
Model
Viterbi Decoding
Bilingual Data
Model1 Modeli ModelK
... ...
Viterbi Decoding
Domain1 Domaini DomainK
Figure: Statistical Alignment Framework (a) vs. Statistical Latent Domain
Alignment Framework (b).
7 / 21
8. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Disentangling the Subdomains
Technical contributions
“Splitting” alignment statistics P(f, a| e) into different
domain-sensitive alignment statistics P(f, a| e, D) with
latent variable D
Combining domain-sensitive alignment statistics
8 / 21
9. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
“Splitting” alignment statistics
fj−1 fj fj+1
aj−1 aj aj+1
Observed layer (source words)
Latent alignment layer (target
words)
Figure: HMM alignment model with observed and latent alignment
layers (a).
9 / 21
10. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
“Splitting” alignment statistics
fj−1 fj fj+1
aj−1 aj aj+1
D
Observed layer (source words)
Latent alignment layer (target
words)
Latent domain layer
Figure: Latent domain HMM alignment model. An additional
latent layer representing domains has been conditioned on by both
the rest two layers.
10 / 21
11. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Likelihood
Likelihood: L ∝
f, e D P(D) P(f| e, D)P(e| D) + P(e|f, D)P(f| D)
A joint model between language models and
translation models
Too complex to train, unfortunately (we cannot learn
from scratch now!).
Deep Neural Networks might help (suggested in the talk
of the speaker)!
11 / 21
12. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Learning
Our temporary solution: EM with Partial Supervision
Number of Domains: The values of D ∈ [1..(N + 1)]
depends on the N available seed samples we know their
domain in advance plus the so-called ”out-domain”.
Parameter Constraints: We keep the domain prior
parameters fixed for all sentence pairs that belong to
seed samples.
12 / 21
13. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Combining domain-sensitive alignment statistics
ˆa = argmax
a
D
P(f, a, D| e)
= argmax
a
D
P(f, a| e, D)P(D| e)
= argmax
a
D
P(f, a| e, D)P(e| D)P(D).
Unfortunately, the decoding problem is NP-hard (see
[DeNero and Macherey(2011),
Chang et al.(2014)Chang, Rush, DeNero, and Collins]).
13 / 21
14. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Combining domain-sensitive alignment statistics
ˆa = argmax
a
D
P(f, a| e, D)P(e| D)P(D).
Two potential solutions
Lagrangian relaxation-based decoder (ack ack I don’t
want to implement this!!!)
Defining an approximate objective function, e.g., its
lower bound (this work!)
ˆa = argmax
a
D
P(f, a| e, D)P(e| D)P(D)
14 / 21
15. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Data Preparation
Legal
Pharmacy
Hardware
The
rest (3.7M)
Cmix
Training latent domain alignment model with the prior
knowledge derived from domain information of three
subsets, comparing alignment accuracy to the baseline.
15 / 21
16. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓
1 Million
Baseline - 66.95 61.29 36.00
Latent
Pharmacy (100K) 67.85 61.72 35.36
Legal (100K) 67.57 62.29 35.17
Hardware (100K) 69.41 63.58 33.63
ALL (300K) 69.64 63.30 33.68
16 / 21
17. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓
2 Million
Baseline - 68.34 61.58 35.22
Latent
Pharmacy (100K) 68.85 62.58 34.43
Legal (100K) 69.98 64.01 33.13
Hardware (100K) 69.45 63.23 33.81
ALL (300K) 71.51 63.87 32.53
17 / 21
18. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Alignment results
Model Prior Prec.↑ Rec.↑ AER↓
4 Million
Baseline - 69.37 64.30 33.26
Latent
Pharmacy (100K) 69.69 62.80 33.94
Legal (100K) 70.51 63.94 32.93
Hardware (100K) 71.75 64.44 32.10
ALL (300K) 72.16 64.30 31.99
18 / 21
19. Latent Domain Word Alignment for Heterogeneous Corpora
1. An Introduction
Discussion
Word alignment should involve latent concepts
representing domains of data
We present the benefits: With the latent domain - the
more we know about the data, the better we can improve
the performance.
We strongly believe this should be applicable for any
statistical model, and not limited into alignment models
only.
Challenge: Can we learn the latent domain (alignment)
models from scratch?
19 / 21
20. Latent Domain Word Alignment for Heterogeneous Corpora
Bibliography
Bibliography I
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
Domain adaptation via pseudo in-domain data selection.
In Proceedings of EMNLP, 2011.
Nguyen Bach, Qin Gao, and Stephan Vogel.
Improving word alignment with language model based confidence scores.
In Proceedings of the Third Workshop on Statistical Machine Translation, 2008.
Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier, Andy Way, and Josef Genabith.
Translation quality-based supplementary data selection by incremental update of translation models.
In Martin Kay and Christian Boitet, editors, COLING 2012, 24th International Conference on
Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012,
Mumbai, India, pages 149–166. Indian Institute of Technology Bombay, 2012.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.
The mathematics of statistical machine translation: parameter estimation.
Comput. Linguist., 1993.
Yin-Wen Chang, Alexander M. Rush, John DeNero, and Michael Collins.
A constrained viterbi relaxation for bidirectional word alignment.
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association
for Computational Linguistics, 2014.
URL http://www.aclweb.org/anthology/P/P14/P14-1139.
20 / 21
21. Latent Domain Word Alignment for Heterogeneous Corpora
Bibliography
Bibliography II
Hoang Cuong and Khalil Sima’an.
Latent domain translation models in mix-of-domains haystack.
In Proceedings of COLING, 2014.
John DeNero and Klaus Macherey.
Model-based aligner combination using dual decomposition.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies - Volume 1. Association for Computational Linguistics, 2011.
URL http://dl.acm.org/citation.cfm?id=2002472.2002526.
Qin Gao, Will Lewis, Chris Quirk, and Mei-Yuh Hwang.
Incremental training and intentional over-fitting of word alignment.
In Proceedings of MT Summit XIII. Asia-Pacific Association for Machine Translation, September 2011.
URL http://research.microsoft.com/apps/pubs/default.aspx?id=153368.
Robert C. Moore and William Lewis.
Intelligent selection of language model training data.
In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 220–224, Stroudsburg, PA,
USA, 2010. Association for Computational Linguistics.
URL http://dl.acm.org/citation.cfm?id=1858842.1858883.
21 / 21