SlideShare a Scribd company logo
Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Grammatical Agreement in SMT
Seminar Sprach-zu-Sprach-Übersetzung
SS 2013
Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Inflection
– Modification of a word
– signals grammatical variants (tense, gender, case, …)
– e.g. walk vs. Walked
Agreement
– Inflection for related words in a sentence has to agree
– e.g. das Haus vs. die Haus
Some languages are weakly inflected (e.g. English)
Some are highly inflected (e.g. German, Arabic, …)
Inflection and Agreement
Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Local Agreement Errors
Ref:
the-carF
goF
with-speed
Hypo:
the-carF
goM
with-speed
Long-distance Agreement Errors
Ref: celle qui parle , c’est ma femme
oneF
who speak , is my wifeF
Hypo: celui qui parle est ma femme
oneM
who speak is my spouseF
Agreement Errors
Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Approaches for SMT
Morphological Generation
– Create raw stems and modify with predicted inflection
Agreement Constraints
– Use SCFG of target and add constraints to it
Class-based Agreement Model
– Use morphological word classes “Noun+Def+Sg+Fem”
Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Idea
“Generating Complex Morphology for Machine Translation” (Minkov
and Toutanova, 2007)
Convert MT output to stem sequence
Predict an inflection for every stem
Reflect meaning and comply with agreement rules
Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Lexicons
Morphology analysis and generation
Operations:
– Stemming
– Inflection
– Morphological analysis
Create manually
Create automatically from data
Here: assumed as given
Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Inflection Prediction
Maximum Entropy Markov model (2nd
order)
Features:
– Monolingual
– Bilingual
– Lexical
– Morphological
– Syntactic
p(̄y∣̄x)=∏t=1
n
p(yt∣ yt−1 , yt−2 , xt ) , yt ∈It
Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Evaluation
English-Russian and English-Arabic
Technical (software manual) domain
Input: Aligned sentence pairs of reference translations (no output of MT
System) → reduce noise
Accuracy (%) results
Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Conclusion
Needed resources:
– Large corpus of aligned sentence pairs
– Lexicons (source and target) with the three operations
+ Better accuracy than simple LM (even with small training data)
+ Easy to add to existing MT system
- Expensive creation of lexicons
Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Idea
“Agreement Constraints for Statistical Machine Translation into
German” (Williams and Koehn, 2011)
String-to-tree model
Synchronous grammar for target language
Adding learned constraints and probabilities
Evaluation of constraints during decoding
Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Feature Structure
Feature structure
Unification
Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Grammar
Synchronous grammar learned from parallel corpus
Extended by constraints at target-side
Sample rule/constraint:
NP-SB → the X1
cat | die AP1
Katze
Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Training
Propagation rules to
capture NP/PP agreements:
Applied bottom-up
Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Decoding
Model:
Every element of rule/constraint has a feature structure
Constraint evaluation: Each hypothesis stores set of feature structures
corresponding to its root rule element
Recombination of hypotheses is possible
̂t=arg max
t
p(t∣s)
p(t∣s)=
1
Z
∑
i=1
n
λi hi (s ,t)
Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Evaluation
English-German
Europarl and News Commentary
Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit
Treebank for target
Grammar: ~140 m rules
BLEU scores and p-values for three test sets
Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Conclusion
Needed resources:
– Parallel corpus
– Heuristics for constraint extraction
+ Improvement in translation accuracy
- Improvement is quite small
Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Idea
1. Segmentation
2. Tagging
3. Scoring
“A Class-Based Agreement Model for Generating Accurately Inflected
Translations” (Green and DeNero, 2012)
During Decoding
Target-Side
Three Steps:
Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Segmentation
Train conditional random field
Features:
Centered 5-character window
During decoding
Not as preprocessing step
Labels:
I: Continuation (Inside)
O: Outside (whitespace)
B: Beginning
F: Non-native chars
Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Tagging
Train CRF on full sentences with gold classes
Features:
– Current and previous words, affixes, etc.
Labels:
– Morphological classes
→ Gender, number, person, definiteness
– e.g. 89 classes for Arabic
Example:
'the car'
Tagged: “Noun+Def+Sg+Fem”
Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Scoring
Scoring of word sequences not comparable across hypotheses
→ Scoring class sequences with generative model
Simple bigram LM over gold class sequences (add-1 smoothed)
τ' =arg max
τ
p(τ∣̂s)
q(e)= p(τ')=∏i=1
I
p(τ'i∣τ'i−1)
Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Evaluation
English-Arabic
Training data: variety of sources (e.g. web)
Development and Test: NIST sets (Newswire and mixed genre
[broadcast news, newsgroups, weblog])
Phrase-based decoder
BLEU score for newswire sets
BLEU score for mixed genre sets
Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Conclusion
Needed resources:
– Treebank for target (existing for many languages)
– Large target corpus
+ Improves translation quality
+ Easy to integrate in existing MT system
- Increases decoding time
- Not very good for mixed genres
Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for
Generating Accurately Inflected Translations”. In: ACL.
Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical
Machine Translation into German”. In: Sixth Workshop on Statistical
Machine Translation
Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology
for Machine Translation”. In: ACL.
References

More Related Content

Viewers also liked

Understanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’sUnderstanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’s
anbray723
 
Translation Problems with 4 Different Languages
Translation Problems with 4 Different LanguagesTranslation Problems with 4 Different Languages
Translation Problems with 4 Different Languages
Tennycut
 
Google translator
Google translatorGoogle translator
Google translator
Laura P
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
Arabic_NLP_ImamU2013
 
Translation problems
Translation problemsTranslation problems
Translation problems
Charley_Long
 
Translation strategy
Translation strategyTranslation strategy
Translation strategy
Siti Purwaningsih
 
Introduction to Translation
Introduction to TranslationIntroduction to Translation
Introduction to Translation
Mohammed Raiyah
 
Grammatical problems in translation
Grammatical problems in translationGrammatical problems in translation
Grammatical problems in translation
Academic Supervisor
 
Challenges of Translation
Challenges of TranslationChallenges of Translation
Challenges of Translation
m nagaRAJU
 
Translation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. BanjarTranslation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. Banjar
Dr. Shadia Banjar
 
Methods Of Translation
Methods Of TranslationMethods Of Translation
Methods Of Translation
Dr. Shadia Banjar
 
Translation techniques presentation
Translation  techniques  presentationTranslation  techniques  presentation
Translation techniques presentation
Angelo pizzuto
 
Translation Types
Translation TypesTranslation Types
Translation Types
Elena Shapa
 
Intercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: LanguageIntercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: Language
Sawyer Education & Training
 
Translation: purpose in practice
Translation: purpose in practiceTranslation: purpose in practice
Translation: purpose in practice
Nicola Thayil
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
SlideShare
 

Viewers also liked (16)

Understanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’sUnderstanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’s
 
Translation Problems with 4 Different Languages
Translation Problems with 4 Different LanguagesTranslation Problems with 4 Different Languages
Translation Problems with 4 Different Languages
 
Google translator
Google translatorGoogle translator
Google translator
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
 
Translation problems
Translation problemsTranslation problems
Translation problems
 
Translation strategy
Translation strategyTranslation strategy
Translation strategy
 
Introduction to Translation
Introduction to TranslationIntroduction to Translation
Introduction to Translation
 
Grammatical problems in translation
Grammatical problems in translationGrammatical problems in translation
Grammatical problems in translation
 
Challenges of Translation
Challenges of TranslationChallenges of Translation
Challenges of Translation
 
Translation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. BanjarTranslation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. Banjar
 
Methods Of Translation
Methods Of TranslationMethods Of Translation
Methods Of Translation
 
Translation techniques presentation
Translation  techniques  presentationTranslation  techniques  presentation
Translation techniques presentation
 
Translation Types
Translation TypesTranslation Types
Translation Types
 
Intercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: LanguageIntercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: Language
 
Translation: purpose in practice
Translation: purpose in practiceTranslation: purpose in practice
Translation: purpose in practice
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 

Grammatical Agreement in SMT

  • 1. Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Grammatical Agreement in SMT Seminar Sprach-zu-Sprach-Übersetzung SS 2013
  • 2. Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Inflection – Modification of a word – signals grammatical variants (tense, gender, case, …) – e.g. walk vs. Walked Agreement – Inflection for related words in a sentence has to agree – e.g. das Haus vs. die Haus Some languages are weakly inflected (e.g. English) Some are highly inflected (e.g. German, Arabic, …) Inflection and Agreement
  • 3. Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Local Agreement Errors Ref: the-carF goF with-speed Hypo: the-carF goM with-speed Long-distance Agreement Errors Ref: celle qui parle , c’est ma femme oneF who speak , is my wifeF Hypo: celui qui parle est ma femme oneM who speak is my spouseF Agreement Errors
  • 4. Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Approaches for SMT Morphological Generation – Create raw stems and modify with predicted inflection Agreement Constraints – Use SCFG of target and add constraints to it Class-based Agreement Model – Use morphological word classes “Noun+Def+Sg+Fem”
  • 5. Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Idea “Generating Complex Morphology for Machine Translation” (Minkov and Toutanova, 2007) Convert MT output to stem sequence Predict an inflection for every stem Reflect meaning and comply with agreement rules
  • 6. Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Lexicons Morphology analysis and generation Operations: – Stemming – Inflection – Morphological analysis Create manually Create automatically from data Here: assumed as given
  • 7. Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Inflection Prediction Maximum Entropy Markov model (2nd order) Features: – Monolingual – Bilingual – Lexical – Morphological – Syntactic p(̄y∣̄x)=∏t=1 n p(yt∣ yt−1 , yt−2 , xt ) , yt ∈It
  • 8. Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Evaluation English-Russian and English-Arabic Technical (software manual) domain Input: Aligned sentence pairs of reference translations (no output of MT System) → reduce noise Accuracy (%) results
  • 9. Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Conclusion Needed resources: – Large corpus of aligned sentence pairs – Lexicons (source and target) with the three operations + Better accuracy than simple LM (even with small training data) + Easy to add to existing MT system - Expensive creation of lexicons
  • 10. Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Idea “Agreement Constraints for Statistical Machine Translation into German” (Williams and Koehn, 2011) String-to-tree model Synchronous grammar for target language Adding learned constraints and probabilities Evaluation of constraints during decoding
  • 11. Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Feature Structure Feature structure Unification
  • 12. Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Grammar Synchronous grammar learned from parallel corpus Extended by constraints at target-side Sample rule/constraint: NP-SB → the X1 cat | die AP1 Katze
  • 13. Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Training Propagation rules to capture NP/PP agreements: Applied bottom-up
  • 14. Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Decoding Model: Every element of rule/constraint has a feature structure Constraint evaluation: Each hypothesis stores set of feature structures corresponding to its root rule element Recombination of hypotheses is possible ̂t=arg max t p(t∣s) p(t∣s)= 1 Z ∑ i=1 n λi hi (s ,t)
  • 15. Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Evaluation English-German Europarl and News Commentary Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit Treebank for target Grammar: ~140 m rules BLEU scores and p-values for three test sets
  • 16. Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Conclusion Needed resources: – Parallel corpus – Heuristics for constraint extraction + Improvement in translation accuracy - Improvement is quite small
  • 17. Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Idea 1. Segmentation 2. Tagging 3. Scoring “A Class-Based Agreement Model for Generating Accurately Inflected Translations” (Green and DeNero, 2012) During Decoding Target-Side Three Steps:
  • 18. Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Segmentation Train conditional random field Features: Centered 5-character window During decoding Not as preprocessing step Labels: I: Continuation (Inside) O: Outside (whitespace) B: Beginning F: Non-native chars
  • 19. Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Tagging Train CRF on full sentences with gold classes Features: – Current and previous words, affixes, etc. Labels: – Morphological classes → Gender, number, person, definiteness – e.g. 89 classes for Arabic Example: 'the car' Tagged: “Noun+Def+Sg+Fem”
  • 20. Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Scoring Scoring of word sequences not comparable across hypotheses → Scoring class sequences with generative model Simple bigram LM over gold class sequences (add-1 smoothed) τ' =arg max τ p(τ∣̂s) q(e)= p(τ')=∏i=1 I p(τ'i∣τ'i−1)
  • 21. Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Evaluation English-Arabic Training data: variety of sources (e.g. web) Development and Test: NIST sets (Newswire and mixed genre [broadcast news, newsgroups, weblog]) Phrase-based decoder BLEU score for newswire sets BLEU score for mixed genre sets
  • 22. Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Conclusion Needed resources: – Treebank for target (existing for many languages) – Large target corpus + Improves translation quality + Easy to integrate in existing MT system - Increases decoding time - Not very good for mixed genres
  • 23. Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for Generating Accurately Inflected Translations”. In: ACL. Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical Machine Translation into German”. In: Sixth Workshop on Statistical Machine Translation Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology for Machine Translation”. In: ACL. References