Your SlideShare is downloading. ×
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

TAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda

2,328

Published on

Kevin Knight, Senior Research Scientist and Fellow, Information Sciences Institute, Research Associate Professor, University of Southern California …

Kevin Knight, Senior Research Scientist and Fellow, Information Sciences Institute, Research Associate Professor, University of Southern California

A clear long-term vision motivates research in automatic language translation. The vision is that you read, write, listen, and speak in your own language, and computer software translates whenever necessary. Reading this paragraph but don't know English? No problem, computer will translate. Launching a new product in Eastern Europe? No problem. Boyfriend doesn't speak Korean? No problem.

This is certainly one of the most compelling visions in computer science, and it has animated a great deal of research. How do we get from here to there? This talk will look at recent improvements, noting how ideas have moved from impractical to mainstream, as well as covering current problems and future directions.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,328
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. TAUS USER CONFERENCE 2010 LANGUAGE BUSINESS INNOVATION 4 – 6 OCTOBER / PORTLAND (OR), USA MONDAY 4 OCTOBER / 16.20 WHAT’S ON THE HORIZON? THE RESEARCH AGENDA Kevin Knight, University of Southern California
  • 2. Processing Human Language • 1940 – Computers were invented to process human language – Input: SRTIF FJKEL SEIPQ QERIU LMWNI … – Output: Troops on Midway have no water … • 2010 – Processing human language by machine is still exciting – Machine translation (speech and text), summarization, sentiment analysis, creative language generation, social network analysis … – Still a challenge: parsing, morphology, semantics, generation, many languages …
  • 3. Statistical Machine Translation Translate, translate, Wow, I can learn a lot of translate … translation patterns! New Source Text Machine Translation Human-translated Output documents
  • 4. Scientific Papers on Statistical MT 1990-1999 A Comparison of Head Transducers and Transfer for a Limited Domain French Speech Recognition in an Automatic Dictation System for Translators: Translation Application the TransTalk Project A DP Based Search Algorithm for Statistical Machine Translation HMM-Based Word Alignment in Statistical Translation A DP Based Search Using Monotone Alignments in Statistical Translation Improved Alignment Models for Statistical Machine Translation Aligning Clauses in Parallel Text Improving Statistical Natural Language Translation by Categories and Rules A Maximum Entropy Approach to Natural Language Processing Learning Parse and Translation Decisions from Examples with Rich Context An Efficient Method for Determining Bilingual Word Classes Machine Translation with a Stochastic Grammatical Channel An IR Approach for Translating New Words from Nonparallel, Comparable Maximum Entropy Model Learning of the Translation Rules Texts Modeling with Structures in Statistical Machine Translation A Polynomial-Time Algorithm for Statistical Machine Translation Resolving Translation Ambiguity using Non-parallel Bilingual Corpora A Statistical Approach to Machine Translation Robust Bilingual Word Alignment for Machine Aided Translation A Statistical MT Tutorial Workbook Statistical Inversion Transduction Grammars and Bilingual Parsing of Parallel Automatic Acquisition of Hierarchical Transduction Models for Machine Corpora Translation The Candide System for Machine Translation Automatically Creating Bilingual Lexicons for Machine Translation from The Mathematics of Statistical Machine Translation: Parameter Estimation Bilingual Text Translation with Finite-State Devices Automatic Construction of Clean Broad-Coverage Translation Lexicons Word Clustering with Parallel Spoken Language Corpora Automatic Discovery of Non-Compositional Compounds Word-Sense Disambiguation Using Statistical Methods Automatic Identification of Word Translations from Unrelated English and German Corpora Automating Knowledge Acquisition for Machine Translation A Word-to-Word Model of Translational Equivalence But Dictionaries Are Data Too About 3 or 4 per year Decoding Algorithm in Statistical Machine Translation Fast Document Translation for Cross-Language Information Retrieval
  • 5. Some Recent Papers (2006-2008) About 150 per year Alignment by Agreement Alignment-Based Discriminative String Similarity An Efficient Two-Pass Approach to Synchronous-CFG Driven Statistical MT An Empirical Study in Source Word Deletion for Phrase-Based Statistical Machine Translation An Empirical Study on Computing Consensus Translations from Multiple Machine Translation (Meta-) Evaluation of Machine Translation Systems A Clustered Global Phrase Reordering Model for Statistical Machine Translation An End-to-End Discriminative Approach to Machine Translation A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation An Integrated Architecture for Speech-Input Multi-Target Machine Translation A Comparison of Syntactically Motivated Word Alignment Spaces An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine A DOM Tree Alignment Model for Mining Parallel Data from the Web Translation A Dependency Treelet String Correspondence Model for Statistical Machine Translation Analysis and System Combination of Phrase- and N-Gram-Based Statistical Machine Translation A Discriminative Framework for Bilingual Word Alignment Systems A Discriminative Global Training Algorithm for Statistical MT Analysis of Statistical and Morphological Classes to Generate Weigthed Reordering Hypotheses A Discriminative Latent Variable Model for Statistical Machine Translation on a Statistical Machine Translation System A Discriminative Matching Approach to Word Alignment Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme A Discriminative Model for Tree-to-Tree Translation Conversion A Discriminative Syntactic Word Order Model for Machine Translation Applying Morphology Generation Models to Machine Translation A Framework for Incorporating Alignment Information in Parsing Arabic Preprocessing Schemes for Statistical Machine Translation A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Arabic to French Sentence Alignment: Exploration of A Cross-language Information Retrieval Information and Web Statistics Approach A Linguistically Annotated Reordering Model for BTG-based Statistical Machine Translation Are Very Large N-Best Lists Useful for SMT? A Maximum Entropy Approach to Combining Word Alignments Automatic Assessment of Student Translations for Foreign Language Tutoring A Maximum Entropy Word Aligner for Arabic-English Machine Translation Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons A New Approach for English-Chinese Named Entity Alignment Automatic Evaluation of Machine Translation Based on Rate of Accomplishment of Sub-Goals A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Automatic Generation of Translation Dictionaries Using Intermediary Languages Language Model BLANC: Learning Evaluation Metrics for MT A Phrase-Based HMM Approach to Document/Abstract Alignment BRUJA: Question Classification for Spanish. Using Machine Translation A Phrase-Based, Joint Probability Model for Statistical Machine Translation Better Alignments = Better Translations? A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation BiTAM: Bilingual Topic AdMixture Models for Word Alignment A Projection Extension Algorithm for Statistical Machine Translation Bilingual Cluster Based Models for Statistical Machine Translation A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation Bilingual-LSA Based LM Adaptation for Spoken Language Translation A Re-examination on Features in Regression Based Approach to Automatic MT Evaluation Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy A Rule-Driven Dynamic Programming Decoder for Statistical MT Biology Based Alignments of Paraphrases for Sentence Compression A Scalable Decoder for Parsing-Based Machine Translation with Equivalent Language Model State Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation Maintenance Boosting Statistical Word Alignment Using Labeled and Unlabeled Data A Sequence Alignment Model Based on the Averaged Perceptron Bootstrapping Lexical Choice via Multiple-Sequence Alignment A Small-Vocabulary Shared Task for Medical Speech Translation Bootstrapping Word Alignment via Word Packing A Smorgasbord of Features for Automatic MT Evaluation Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation A Systematic Comparison of Training Criteria for Statistical Machine Translation Building a Statistical Machine Translation System for French Using the Europarl Corpus A Translation Aid System with a Stratified Lookup Interface CCG Supertags in Factored Statistical Machine Translation A Translation Model for Sentence Retrieval CDER: Efficient MT Evaluation Using Block Movements A Tree Sequence Alignment-based Tree-to-Tree Translation Model Can the Internet help improve Machine Translation? A Tree-to-String Phrase-based Model for Statistical Machine Translation Can we Relearn an RBMT System? A Walk on the Other Side: Using SMT Components in a Transfer-Based Translation System Capitalizing Machine Translation A Wearable Headset Speech-to-Speech Translation System Chinese Syntactic Reordering for Statistical Machine Translation A Web-based Demonstrator of a Multi-lingual Phrase-based Translation System Chinese-English Term Translation Mining Based on Semantic Prediction ATLAS A New Text Alignment Architecture Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Adaptive Language and Translation Models for Interactive Machine Translation Statistical Machine Translation Alignment Link Projection Using Transformation-Based Learning Cognate Identification and Alignment Using Practical Orthographies
  • 6. Some Recent Papers (2006-2008) Cohesive Phrase-Based Decoding for Statistical Machine Translation Evaluating Task Performance for a Unidirectional Controlled Language Medical Speech Combination of Arabic Preprocessing Schemes for Statistical Machine Translation Translation System Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes Exact Decoding for Jointly Labeling and Chunking Sequences Combining Morphosyntactic Enriched Representation with n-best Reranking in Statistical Expanding Indonesian-Japanese Small Translation Dictionary Using a Pivot Language Translation Experiments in Discriminating Phrase-Based Translations on the Basis of Syntactic Coupling Combining Multiple Resources to Improve SMT-based Paraphrasing Model Features Combining Outputs from Multiple Machine Translation Systems Experiments in Domain Adaptation for Statistical Machine Translation Combining Source and Target Language Information for Name Tagging of Machine Translation Exploiting N-best Hypotheses for SMT Self-Enhancement Output Exploiting Variant Corpora for Machine Translation Comparing Reordering Constraints for SMT Using Efficient BLEU Oracle Computation Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation Computational Complexity of Statistical Machine Translation Extending MARIE: an N-gram-based SMT decoder Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Extentions to HMM-based Statistical Word Alignment Models Hypothesis Alignment Factored Translation Models Computing Term Translation Probabilities with Generalized Latent Semantic Analysis Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce Computing Translation Units and Quantifying Parallelism in Parallel Dependency Treebanks Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity Constraining the Phrase-Based, Joint Probability Statistical Translation Model Finding Terminology Translations from Non-parallel Corpora Context-aware Discriminative Phrase Selection for Statistical Machine Translation First Steps towards a General Purpose French/English Statistical Machine Translation System Context-based Arabic Morphological Analysis for Machine Translation Forest Rescoring: Faster Decoding with Integrated Language Models Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation Forest-Based Translation Continuous Space Language Models for Statistical Machine Translation Forest-to-String Statistical Translation Rules Converser (TM): Highly Interactive Speech-to-Speech Translation for Healthcare Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Correcting ESL Errors Using Phrasal SMT Techniques Machine Translation Deep Grammars in a Tree Labeling Approach to Syntax-based Statistical Machine Translation From Machine Translation to Computer Assisted Translation using Finite-State Models Dependencies vs. Constituents for Tree-Based Alignment From Words to Corpora: Recognizing Translation Dependency-Based Automatic Evaluation for Machine Translation From indexing the biomedical literature to coding clinical text: experience with MTI and machine Design of the Moses Decoder for Statistical Machine Translation learning approaches Direct Translation Model 2 Further Meta-Evaluation of Machine Translation Discriminative Alignment Training without Annotated Data for Machine Translation Generalized Graphical Abstractions for Statistical Machine Translation Discriminative Reordering Models for Statistical Machine Translation Generalized Stack Decoding Algorithms for Statistical Machine Translation Discriminative Word Alignment via Alignment Matrix Modeling Generalizing Local Translation Models Discriminative Word Alignment with Conditional Random Fields Generalizing Word Lattice Translation Distortion Models for Statistical Machine Translation Generating Case Markers in Machine Translation Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Generating Complex Morphology for Machine Translation Translation Generation by Inverting a Semantic Parser that Uses Statistical Machine Translation Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation Generation in Machine Translation from Deep Syntactic Trees Domain Adaptation in Statistical Machine Translation with Mixture Modelling Generation of Word Graphs in Statistical Machine Translation Dynamic Model Interpolation for Statistical Machine Translation Getting the Structure Right for Word Alignment: LEAF Effects of Morphological Analysis in Translation between German and English Getting to Know Moses: Initial Experiments on German-English Factored Translation Efficient Algorithms for Richer Formalisms: Parsing and Machine Translation Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT Efficient Decoding for Statistical Machine Translation with a Fully Expanded WFST Model Grammatical Machine Translation Efficient Dynamic Programming Search Algorithms for Phrase-Based SMT Grouping Multi-word Expressions According to Part-Of-Speech in Statistical Machine Translation Efficient Handling of N-gram Language Models for Statistical Machine Translation Guiding Statistical Word Alignment Models With Prior Knowledge Efficient Multi-Pass Decoding for Synchronous Context Free Grammars HMM Word and Phrase Alignment for Statistical Machine Translation Efficient Phrase-Table Representation for Machine Translation with Applications to Online MT Handling of Prepositions in English to Bengali Machine Translation and Speech Translation Hierarchical Phrase-Based Translation with Suffix Arrays Empirical Lower Bounds on the Complexity of Translational Equivalence Hierarchical System Combination for Machine Translation English-to-Czech Factored Machine Translation How Many Bits Are Needed To Store Probabilities for Phrase-Based Translation? Enriching Morphologically Poor Languages for Statistical Machine Translation Human Evaluation of Machine Translation Through Binary System Comparisons Enriching Spoken Language Translation with Dialog Acts Human Judgments in Parallel Treebank Alignment European Language Translation with Weighted Finite State Transducers: The CUED MT System for the 2008 ACL Workshop on SMT
  • 7. Some Recent Papers (2006-2008) ILR-Based MT Comprehension Test with Multi-Level Questions Local Phrase Reordering Models for Statistical Machine Translation Imposing Constraints from the Source Tree on ITG Constraints for SMT MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation Improved Alignment Models for Statistical Machine Translation METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Improved Discriminative Bilingual Word Alignment Judgments Improved Lexical Alignment by Combining Multiple Reified Alignments MT Evaluation: Human-Like vs. Human Acceptable Improved Statistical Machine Translation Using Paraphrases MTTK: An Alignment Toolkit for Statistical Machine Translation Improved Statistical Machine Translation by Multiple Chinese Word Segmentation MaTrEx: The DCU MT System for WMT 2008 Improved Tree-to-String Transducer for Machine Translation Machine Translation System Combination using ITG-based Alignments Improved Word-Level System Combination for Machine Translation Machine Translation as Tree Labeling Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Machine Translation between Turkic Languages Sentence Paraphrasing, Tokenization, and Recasing Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora Improving Multilingual Summarization: Using Redundancy in the Input to Correct MT errors Manual and Automatic Evaluation of Machine Translation between European Languages Improving Statistical MT through Morphological Analysis Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation Improving Statistical Machine Translation Performance by Training Data Selection and Measure Word Generation for English-Chinese SMT Systems Optimization Meta-Structure Transformation Model for Statistical Machine Translation Improving Statistical Machine Translation Using Word Sense Disambiguation Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Improving Translation Quality by Discarding Most of the Phrasetable Machine Translation Output Improving Word Alignment Models using Structured Monolingual Corpor Microsoft Research Treelet Translation System: NAACL 2006 Europarl Evaluation Improving Word Alignment Using Syntactic Dependencies Minimum Bayes Risk Decoding for BLEU Improving Word Alignment with Bridge Languages Minimum Bayes-Risk Word Alignments of Bilingual Texts Improving Word Alignment with Language Model Based Confidence Scores Mining Key Phrase Translations from Web Corpora Incremental Hypothesis Alignment for Building Confusion Networks with Application to Machine Mining Parenthetical Translations from the Web by Word Alignment Translation System Combination Mitigation of Data Sparsity in Classifier-Based Translation Individuality and Alignment in Generated Dialogues Mixture-Model Adaptation for SMT Inducing Word Alignments with Bilexical Synchronous Trees Modelling Lexical Redundancy for Machine Translation Inductive Detection of Language Features via Clustering Minimal Pairs: Toward Feature-Rich Monolingual Machine Translation for Paraphrase Generation Grammars in Machine Translation Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation Initial Explorations in English to Turkish Statistical Machine Translation Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Inner-Outer Bracket Models for Word Alignment using Hidden Blocks Output Integration of Speech to Computer-Assisted Translation Using Finite-State Automata Moses: Open Source Toolkit for Statistical Machine Translation Integration of an Arabic Transliteration Module into a Statistical Machine Translation System Multi-Engine Machine Translation with an Open-Source SMT Decoder Inversion Transduction Grammar for Joint Phrasal Translation Modeling Multi-dimensional Annotation and Alignment in an English-German Translation Corpus Joint Morphological-Lexical Language Modeling for Machine Translation Multilingual Search for Cultural Heritage Archives via Combining Multiple Translation Resources Kernel Regression Based Machine Translation Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Kernel Regression Framework for Machine Translation: UCL System Description for WMT 2008 Decoding Shared Translation Task Multiple Reorderings in Phrase-Based Machine Translation Keyword Translation Accuracy and Cross-Lingual Question Answering in Multiword Units in an MT Lexicon King Alfred: A Translation Environment for Learners of Anglo-Saxon English N-Gram Posterior Probabilities for Statistical Machine Translation Knowledge Sources for Word-Level Translation Models N-gram-based SMT System Enhanced with Reordering Patterns Labelled Dependencies in Machine Translation Evaluation NICT-ATR Speech-to-Speech Translation System Language Models and Reranking for Machine Translation NRC's PORTAGE System for WMT 2007 Large Language Models in Machine Translation NTT System Description for the WMT2006 Shared Task Latent Features in Automatic Tense Translation between Chinese and English Name Translation in Statistical Machine Translation - Learning When to Transliterate Learning Alignments and Leveraging Natural Logic Named Entities Translation Based on Comparable Corpora Learning Performance of a Machine Translation System: a Statistical and Computational Analysis NeurAlign: Combining Word Alignments Using Neural Networks Learning for Semantic Parsing with Statistical Machine Translation Ngram-Based Statistical Machine Translation Enhanced with Multiple Weighted Reordering Left-to-Right Target Generation for Hierarchical Phrase-Based Translation Hypotheses Leveraging Reusability: Cost-Effective Lexical Acquisition for Large-Scale Ontology Translation Online Large-Margin Training for Statistical Machine Translation Lexical-Functional Correspondences and Their Use in the System of Machine Translation ETAP-3 Optimal Constituent Alignment with Edge Covers for Semantic Projection Limsi's Statistical Translation Systems for WMT-08 Linguistic Features for Automatic Evaluation of Heterogenous MT Systems
  • 8. Some Recent Papers (2006-2008) Optimizing Chinese Word Segmentation for Machine Translation Performance Soft Syntactic Constraints for Hierarchical Phrased-Based Translation POSSLT: A Korean to English Spoken Language Translation System Soft Syntactic Constraints for Word Alignment through Discriminative Training Parallel Implementations of Word Alignment Tool Source-Language Features and Maximum Correlation Training for Machine Translation Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Evaluation Diachronic Texts Spectral Clustering for Example Based Machine Translation Partial Matching Strategy for Phrase-based Statistical Machine Translation Speech Translation for Triage of Emergency Phonecalls in Minority Languages Phrasal Cohesion and Statistical Machine Translation Speech Translation with Grammatical Framework Phrase Reordering Model Integrating Syntactic Knowledge for SMT Speech to Speech Translation for Medical Triage in Korean Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages Speech to Speech Translation for Nurse Patient Interaction Phrase-Based SMT with Shallow Tree-Phrases Speech-Input Multi-Target Machine Translation Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction Phrasetable Smoothing for Statistical Machine Translation Statistical Machine Translation for Query Expansion in Answer Retrieval Pivot Language Approach for Phrase-Based Statistical Machine Translation Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction Prior Derivation Models For Formally Syntax-Based Translation Using Linguistically Syntactic Statistical Phrase-Based Models for Interactive Computer-Assisted Translation Parsing and Tree Kernels Statistical Post-Editing on SYSTRAN’s Rule-Based Translation System Probabilistic Synchronous Tree-Adjoining Grammars for Machine Translation: The Argument Statistical Significance Tests for Machine Translation Evaluation from Bilingual Dictionaries Statistical Transfer Systems for French-English and German-English Machine Translation Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies Stochastic Inversion Transduction Grammars for Obtaining Word Phrases for Phrase-based Query Translation in Chinese-English Cross-Language Information Retrieval Statistical Machine Translation Randomised Language Modelling for Statistical Machine Translation Stochastic Iterative Alignment for Machine Translation Evaluation Ranking vs. Regression in Machine Translation Evaluation Stochastic Language Generation Using WIDL-Expressions and its Application in Machine Rapid Portability among Domains in an Interactive Spoken Language Translation System Translation and Summarization Re-Usable Tools for Precision Machine Translation Sub-Sentential Alignment Using Substring Co-Occurrence Counts Re-evaluating Machine Translation Results with Paraphrase Support Supertagged Phrase-Based Statistical Machine Translation Re-evaluation the Role of Bleu in Machine Translation Research Synchronous Binarization for Machine Translation Realization of the Chinese BA-construction in an English-Chinese Machine Translation System Syntactic Re-Alignment Models for Machine Translation Recent Improvements in the CMU Large Scale Chinese-English SMT System Syntactic Reordering Integrated with Phrase-Based SMT Regression for Sentence-Level MT Evaluation with Pseudo References Syntax Augmented Machine Translation via Chart Parsing Relabeling Syntax Trees to Improve Syntax-Based Machine Translation Quality Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Reranking Translation Hypotheses Using Structural Properties Parsed Parallel Corpora Rich Source-Side Context for Statistical Machine Translation Tailoring Word Alignments to Syntactic Machine Translation Robust Bilingual Word Alignment for Machine Aided Translation TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer Robust Word Sense Translation by EM Learning of Frame Semantics Text-Translation Alignment: Three Languages Are Better Than Two Rule-Based Translation with Statistical Phrase-Based Post-Editing The "Noisier Channel": Translation from Morphologically Complex Languages S-MINDS 2-Way Speech-to-Speech Translation System The Complexity of Phrase Alignment Problems SPMT: Statistical Machine Translation with Syntactified Target Language Phrases The Effect of Machine Translation on the Performance of Arabic-English Scalable Inference and Training of Context-Rich Syntactic Translation Models The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval Searching for alignments in SMT. A novel approach based on an Estimation of Distribution The Hiero Machine Translation System: Extensions, Evaluation, and Analysis Algorithm The ISL Phrase-Based MT System for the 2007 ACL Workshop on Statistical Machine Translation Segment Choice Models: Feature-Rich Models for Global Distortion in Statistical Machine The LDV-COMBO system for SMT Translation The MetaMorpho Translation System Segmentation for English-to-Arabic Statistical Machine Translation The Role of Pseudo References in MT Evaluation Selective Phrase Pair Extraction for Improved Statistical Machine Translation The Syntax Augmented MT (SAMT) System at the Shared Task for the 2007 ACL Workshop on Semi-Supervised Training for Statistical Word Alignment Statistical Machine Translation Sentence Alignment for Monolingual Comparable Corpora The TALP-UPC Ngram-Based Statistical Machine Translation System for ACL-WMT 2008 Sentence Level Machine Translation Evaluation as a Ranking The University of Washington Machine Translation System for ACL WMT 2008 Simple Preposition Correspondence: A Problem in English to Indian Language Machine Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora Translation Towards better Machine Translation Quality for the German-English Language Pairs Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Training Non-Parametric Features for Statistical Machine Translation Parsing and Transfer Translation Model Pruning via Usage Statistics for Statistical Machine Translation Smooth Bilingual $N$-Gram Translation Tree-to-String Alignment Template for Statistical Machine Translation
  • 9. Some Recent Papers (2006-2008) TwicPen: Hand-held Scanner and Translation Software for non-Native Readers Two Tools for Creating and Visualizing Sub-sentential Alignments of Parallel Text UCB System Description for the WMT 2007 Shared Task Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora Usability Issues in an Interactive Speech-to-Speech Translation System for Healthcare Using Dependency Order Templates to Improve Generality in Translation Using Information about Multi-word Expressions for the Word-Alignment Task Using Moses to Integrate Multiple Rule-Based Machine Translation Engines into a Hybrid System whew … Using Paraphrases for Parameter Tuning in Statistical Machine Translation Using RBMT Systems to Produce Bilingual Corpus for SMT Using Shallow Syntax Information to Improve Word Alignment and Reordering for SMT Using Syntactic Coupling Features for Discriminating Phrase-Based Translations (WMT-08 Shared Translation Task) Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation Using a Probabilistic Translation Model for Cross-Language Information Retrieval Viterbi Based Alignment between Text Images and their Transcripts What Can Syntax-Based MT Learn from Phrase-Based MT? Word Alignment for Languages with Scarce Resources Using Bilingual Corpora of Other Language Pairs Word Alignment in English-Hindi Parallel Corpus Using Recency-Vector Approach: Some Studies Word Alignment of English-Chinese Bilingual Corpus Based on Chucks Word Alignment via Quadratic Assignment Word Sense Disambiguation Improves Statistical Machine Translation Word-Level Confidence Estimation for Machine Translation using Phrase-Based Translation Models Word-Sense Disambiguation for Machine Translation XML-based Phrase Alignment in Parallel Treebanks Yawat: Yet Another Word Alignment Tool So there’s a great deal of research to talk about …
  • 10. What Do Users Care About? • MT has improved a lot since 1940!
  • 11. What Do Users Care About? • MT has improved a lot since 1940! • Blank stare ...
  • 12. What Do Users Care About? • MT has improved a lot since 1940! • Blank stare ... • Tough audience.
  • 13. My First User
  • 14. My First User
  • 15. My First User
  • 16. My First User Can I have it in pink?
  • 17. My First User Can I have it ??????? in pink? argh!!
  • 18. Listening to Users • Users want new languages. • Users want to know how good the MT is. • Users want some sort of guarantees. • Users want it in pink. • Users want to translate: – “I love you” (#1 most popular web-translation input) – “I hate you” (#6 most popular)
  • 19. Of Course, for Researchers, Rockets are Just Cool • 1960 – A: Why do we need rockets? – B: Huh? – A: What are rockets for? B – B: Um, sending things into space. A – A: Like what? – B: I don’t know. Sending people to the moon, to pick up rocks. – A: And do what with the rocks? – B: Hmm … can I get back to work … ? • 2010 – Space tourism? • 2100 – Mars colonies • 2400 – Humans on thousands of planets throughout the galaxy
  • 20. Of Course, for Researchers, Rockets are Just Cool • 1960 – A: Why do we need rockets? – B: Huh? – A: What are rockets for? B – B: Um, sending things into space. A – A: Like what? – B: I don’t know. Sending people to the moon, to pick up rocks. – A: And do what with the rocks? – B: Hmm … can I get back to work … ? • 2010 Animating research – Space tourism? vision! • 2100 – Mars colonies • 2400 – Humans on thousands of planets throughout the galaxy
  • 21. Kinds of Research Vision • Straightforward • Esoteric • Epiphany-style • Naïve
  • 22. Straightforward Research Vision • Speak • Hear • Read • Write – in your own language • Computer translates when appropriate – screen, earbuds, goggles
  • 23. Esoteric Research Vision • Software agents, robots, and appliances will communicate with each other using English instead of specialized APIs. – Because English is a de facto standard – Because they have to talk to us anyway • Chinese software agents will speak Chinese • So MT software will be needed – Even if there are no humans in the picture
  • 24. Epiphany-Style Research Vision • In the future, being able to speak a foreign language will be like being able to multiple five-digit numbers in your head. some talent + lots of practice = a weird skill
  • 25. Naïve Research Vision • My MT technology will enhance world peace • And I can use it to talk to Russian girls
  • 26. Back to Reality! Let’s look at some translations …
  • 27. Translations 1. China seized 186 tons of American beef. 2. China seizes 180 tons of US beef. 3. China has detained 186 tons of American beef. 4. China confiscates 186 tons of U.S. beef. 5. China has seized 186 tons of US beef.
  • 28. Translations 1. China seized 186 tons of American beef. 2. China seizes 180 tons of US beef. 3. China has detained 186 tons of American beef. 4. China confiscates 186 tons of U.S. beef. 5. China has seized 186 tons of US beef. machine translator
  • 29. Translations 1. China seized 186 tons of American beef. human translator 2. China seizes 180 tons of US beef. 3. China has detained 186 tons of American beef. 4. China confiscates 186 tons of U.S. beef. 5. China has seized 186 tons of US beef. machine translator
  • 30. Translations 印度 目前 共 有 74 种 控价 药 , 增加 后 的 控价 药品 将 占 印度 所售 药品 的 40% 以上 。 Machine translation: Currently, a total of 74 types of medicine prices increased after the price of medicines will account for more than 40 per cent of medicines sold by India.
  • 31. What Should Happen: re-order rule S monotone rule S , CC S ADV , NP VP , and NP VP currently , NNP VB NP NP PP MD VP India has NP PP JJ NNS IN NP will VB PP DT NN IN NP pc drugs after DT NN account IN NP a total of NP PP the increase for NP PP CD NNS IN NP JJ JJ CD NN IN NP 74 types of JJ NNS more than 40 % of NP SBAR pc drugs NNS VP drugs VBN PP sold IN NP in India 印度 目前 共 有 74 种 控价 药 , 增加 后 的 控价 药品 将 占 印度 所售 药品 的 40% 以上 。 India currently total has 74 type price-contrl drug , increase after de price-contrl drug will occupy India sold by drug de 40% more than
  • 32. Translations Machine Translation: The Moroccan monarch said, “I hope that the year 2004 is coupled with the victory of peace in all areas of the world, which suffer from thorny conflicts.”
  • 33. Translations Israel plans to build nine additional 1 settlements in the Golan Heights which it seized from Syria in 1967. Israel plans to establish 9 new reclamation 2 districts in 1976 on its occupied Golan Heights from Syria. Israel plans to create nine new settlements 3 in the occupied Golan Heights in 1967.
  • 34. Translations Machine Translation: “The visit is to review the last information we collected information as a starting point … we will try to put clips portfolio.”
  • 35. Translations six unit Iraq civilian today in Iraq south part possessive protest in , suffer police and UK troops shot killed . Machine Translation: Police and British troops shot and killed six Iraqi civilians in protests in southern Iraq today.
  • 36. We’ve Come a Long Way • 1999: – Statistical MT very slow, one page per day – Quality was not good compared to established systems • 2010: – Statistical MT runs fast enough – Quality is state-of-the-art, and improving
  • 37. We also Have Lots of Languages Company X Company Y Western Europe Danish, Dutch, Finnish, French, Catalan, Danish, Dutch, French, German, Greek, Italian, Galician, German, Greek, Norwegian, Portuguese, Spanish, Icelandic, Irish, Italian, Maltese, Swedish Norwegian, Polish, Portuguese, Spanish, Swedish, Welsh Eastern Europe Bulgarian, Czech, Hungarian, Albanian, Belarusian, Bulgarian, Croatian, Czech, Estonian, Finnish, Hungarian, Latvian, Polish, Romanian, Russian, Lithuanian, Macedonian, Romanian, Serbian, Turkish Russian, Serbian, Slovak, Slovenian, Turkish, Ukrainian, Yiddish Middle East & Africa Arabic, Hausa, Hebrew, Pashto, Afrikaans, Arabic, Hebrew, Persian, Somali, Urdu Persian, Swahili Asia Chinese, Hindi, Indonesian, Chinese, Filipino, Hindi, Japanese, Korean, Thai Indonesian, Japanese, Korean, Malay, Thai
  • 38. Accuracy Gains • Bleu score for immediate feedback on research ideas • Syntax – Synchronous binarization – Re-structuring, re-labeling, re-aligning – Dependency language models • Minimum Bayes risk • System combination – Large gains from combining 5, 10, 20 systems – Also critical to NetFlix challenge and protein folding contests • Search error reduction • More use of context – Word translation probabilities differ in different contexts • Non-contiguous phrases • Discriminative training methods • Web-scale parallel data and 1000 other good ideas that didn’t work!
  • 39. Bleu Supports Research 35 30 Translation Accuracy 25 (Bleu) 20 15 Mar Apr May 2005 1 1 1
  • 40. Accuracy Gains • Bleu score for immediate feedback on research ideas • Syntax – Synchronous binarization – Re-structuring, re-labeling, re-aligning – Dependency language models • Minimum Bayes risk • System combination – Large gains from combining 5, 10, 20 systems – Also critical to NetFlix challenge and protein folding contests • Search error reduction • More use of context – Word translation probabilities differ in different contexts • Non-contiguous phrases • Discriminative training methods • More parallel and monolingual data and 1000 other good ideas that didn’t work!
  • 41. In 2010 “Large Scale Parallel Document Mining for MT” – Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner input: 2.5b web pages output: parallel pages Key technical point: Sentence aligned into Clever algorithms 14b words avoid having to do of parallel 2.5b x 2.5b page training data comparisons. More accuracy, higher Bleu
  • 42. In 2010 “Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation” – George Foster, Cyril Goutte, and Roland Kuhn In-domain Out-of-domain Doesn’t take advantage of out-of-domain data. parallel data parallel data Out-of-domain data messes up translations, often degrades accuracy. Weighting in-domain data can recover degraded accuracy. New method, key technical point: Weight individual phrase pairs with an involved formula. Result = +2.0 Bleu.
  • 43. In 2010 “Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation” – Erica Greene, Tugba Bodrumlu, and Kevin Knight Franz Och (2004): Welcome to Google Translate … we do not recommend translating poetry. input: Italian poem Ita/Eng Raw English poetry Love trigram model poetry Probabilistic Phrase-Based SMT dictionary with syllable/stress Lattice of translations patterns (“prisoner” = S* S) Rhythm enforcement Love poetry generator output: English poem, fitting desired meter
  • 44. Horizons Now Future Lots of parallel data for new domain  Learn from monolingual domain data Lots of parallel data for new LP  Learn from monolingual language data Hand-built word segmentation rules  Universal morphology  Morpho-syntactic translation Syntax-based translation  Deep syntax (passives, empty elements)  Semantic translation GIZA++ word alignment as pipeline  Syntax-sensitive word alignment component (17-year-old tech!) Bleu diagnostic at corpus level  Bleu++ diagnostic at sentence level Augment parallel data, inductive learning  One-shot learning from human feedback
  • 45. Morpho-Syntactic Models? • Syntax-based SMT can do this: – “Move the direct object before the verb.” • We yet can’t do this: – “Move the direct object before the verb, and mark it with accusative case” • Can we reason about letter sequences instead of just word sequences?
  • 46. Current Syntax-Based SMT RULE BASE q.JJ(red) <-> rojo q.JJ(green) <-> verde q.JJ(red) <-> roja q.JJ(green) <-> verdes q.JJ(red) <-> rojos q.N(cat) <-> gato Very large q.JJ(red) <-> rojas q.N(cats) <-> gatos wordform-to-wordform q.N(car) <-> coche q.N(moon) <-> luna dictionary q.N(cars) <-> coches q.N(moons) <-> lunas q.DT(a) <-> un q.N(light) <-> luz q.DT(a) <-> una q.N(lights) <-> luzes q.NP(x0:DT x1:JJ x2:N) <-> q.x0 q.x2 q.x1 Simple syntactic % echo ’NP(DT(a) JJ(red) N(car))’ | combination tiburon -l -k 8 - sbmt.xlnts OUTPUTS: una coche rojo # 1.0 un coche rojas # 1.0 Overgeneration una coche rojas # 1.0 una coche roja # 1.0 (rely on language model) un coche rojos # 1.0 un coche rojo # 1.0 una coche rojos # 1.0 un coche roja # 1.0
  • 47. Current Syntax-Based SMT Original input: Transformation: NP q NP DT JJ N DT JJ N a red car a red car
  • 48. Current Syntax-Based SMT Original input: Transformation: q NP 0.2  q x0, q x2, q x1 NP q NP x0:DT x1:JJ x2:N DT JJ N DT JJ N a red car a red car
  • 49. Current Syntax-Based SMT Original input: Transformation: NP DT JJ N q DT q NN q JJ , , a red car a car red
  • 50. Current Syntax-Based SMT Original input: Transformation: q DT 0.2  una NP a DT JJ N q DT q NN q JJ , , a red car a car red
  • 51. Current Syntax-Based SMT Original input: Transformation: NP DT JJ N q NN q JJ una , , a red car car red
  • 52. Current Syntax-Based SMT Original input: Transformation: q NN 0.2  coche NP car DT JJ N q NN q JJ una , , a red car car red
  • 53. Current Syntax-Based SMT Original input: Transformation: NP DT JJ N q JJ una , coche , a red car red
  • 54. Current Syntax-Based SMT Original input: Transformation: NP DT JJ N una , coche , rojas a red car
  • 55. Possible Morpho-Syntactic SMT RULE BASE qjo.red <-> r o j qje.green <-> v e r d e Compact qnsmasc.car <-> c o c h e qnsmasc.cat <-> g a t o root-to-root qnsfem.moon <-> l u n a qnesfem.light <-> l u z qdmasc.a <-> u n qdfem.a <-> u n a dictionary qmasc.JJ(x0:) <-> qjo.x0 o qmasc.N(x0:) <-> qnsmasc.x0 qmasc.JJ(x0:) <-> qje.x0 qmasc.N(x0:) <-> qnesmasc.x0 qplmasc.x0:JJ <-> qmasc.x0 s qplmasc.N(x0: x1:pl) <-> qnsmasc.x0 s Morpho- qplmasc.N(x0: x1:pl) <-> qnesmasc.x0 e s syntax ... q.NP(x0:DT x1:JJ x2:N) <-> qdmasc.x0 _ qmasc.x2 _ qmasc.x1 % echo ’NP(DT(a) JJ(red) N(car))’ | tiburon -l -k 1 - msmt.xlnts OUTPUTS: Translation u n _ c o c h e _ r o j o # 1.0 (no other outputs)
  • 56. Possible Morpho-Syntactic SMT Original input: Transformation: NP q NP DT JJ N DT JJ N a red car a red car
  • 57. Possible Morpho-Syntactic SMT Original input: Transformation: q NP 0.2  qdmasc x0 , _ , qmasc x2 , _ , qmasc x1 NP q NP x0:DT x1:JJ x2:N DT JJ N DT JJ N a red car a red car
  • 58. Possible Morpho-Syntactic SMT Original input: Transformation: q NP 0.2  qdmasc x0 , _ , qmasc x2 , _ , qmasc x1 NP x0:DT x1:JJ x2:N DT JJ N qdmasc DT , _ , qmasc N , _ , qmasc JJ a red car a car red
  • 59. Possible Morpho-Syntactic SMT Original input: Transformation: qdmasc DT 0.2  u, n NP a DT JJ N qdmasc DT , _ , qmasc N , _ , qmasc JJ a red car a car red
  • 60. Possible Morpho-Syntactic SMT Original input: Transformation: qdmasc DT 0.2  u, n NP a DT JJ N u , n , _ , qmasc N , _ , qmasc JJ a red car car red
  • 61. Possible Morpho-Syntactic SMT Original input: Transformation: qmasc JJ 0.2  qjo x0 , o NP x0 DT JJ N u , n , _ , qmasc N , _ , qmasc JJ a red car car red
  • 62. Possible Morpho-Syntactic SMT Original input: Transformation: qmasc JJ 0.2  qjo x0 , o NP x0 DT JJ N u , n , _ , qmasc N , _ , qjo red , o a red car car
  • 63. Possible Morpho-Syntactic SMT Original input: Transformation: 0.2 qjo red  r,o,j NP DT JJ N u , n ,_, qmasc N , _ , qjo red , o a red car car
  • 64. Possible Morpho-Syntactic SMT Original input: Transformation: 0.2 qjo red  r,o,j NP DT JJ N u , n ,_, qmasc N , _ , r , o, j , o a red car car
  • 65. Possible Morpho-Syntactic SMT Original input: Transformation: NP DT JJ N u , n , _ , c , o , c , h e , _ , r , o, j , o a red car Possible to create new target words never seen in parallel data.
  • 66. Mixing the Strengths of RBMT and SMT The RBMT system Space of learning I would write possibilities for SMT Ideal Set-up
  • 67. More Languages
  • 68. More Languages CONTINENTAL UNITED STATES
  • 69. More Languages CONTINENTAL UNITED STATES
  • 70. More Languages CONTINENTAL UNITED STATES
  • 71. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN
  • 72. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN
  • 73. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN MEXICO
  • 74. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN JAPAN CENT. AMERICA MEXICO
  • 75. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN JAPAN CENT. AMERICA MEXICO
  • 76. More Languages CONTINENTAL UNITED INDIA STATES AND PAKISTAN EUROPEAN UNION JAPAN CENT. AMERICA MEXICO
  • 77. More Languages almost CONTINENTAL as big as UNITED INDIA the Moon STATES AND PAKISTAN EUROPEAN UNION JAPAN CENT. AMERICA MEXICO
  • 78. More Languages almost CONTINENTAL as big as UNITED INDIA the Moon STATES AND PAKISTAN EUROPEAN UNION 1000+ languages spoken, JAPAN 40+ by 1m+ CENT. AMERICA speakers MEXICO
  • 79. One Last Research Vision automatic electronic statistical machine routing of telephone translation calls automatic routing of machine translation telephone calls call routing translation
  • 80. One Last Research Vision automatic electronic statistical machine routing of telephone translation calls automatic routing of machine translation telephone calls call routing translation
  • 81. thanks!

×