Delivered at the European Patent Office's annual Patent Information Conference (EPOPIC 2014)
November 5th 2014
Warsaw, Poland.
In this talk, we give an introduction as to how machine translation works and what makes certain content types and languages more difficult than others.
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
Machine Translation: The Neural FrontierJohn Tinsley
This was a pitch for Iconic's neural machine translation technology given at the TAUS Annual Conference in Portland, Oregan on October 24th, 2016.
There has been a lot of talk, and a lot of hype about neural machine translation in the press. But not a lot of practical application. Let's change the conversation
Delivered at the European Patent Office's Patent Information Conference.
November 11th 2015
Miami, Florida.
In this talk, we talk about recent advances in MT for patents and introduce our IPTranslator.com application for on-demand translation.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
In this slides the basic concept of machine translation is described.MT challenges are represented and describes rule-based and statistical MT briefly. Some notes about evaluation is described too
This slides covers introduction about machine translation, some technique using in MT such as example based MT and statistical MT, main challenge facing us in machine translation, and some examples of application using in MT
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
Machine Translation: The Neural FrontierJohn Tinsley
This was a pitch for Iconic's neural machine translation technology given at the TAUS Annual Conference in Portland, Oregan on October 24th, 2016.
There has been a lot of talk, and a lot of hype about neural machine translation in the press. But not a lot of practical application. Let's change the conversation
Delivered at the European Patent Office's Patent Information Conference.
November 11th 2015
Miami, Florida.
In this talk, we talk about recent advances in MT for patents and introduce our IPTranslator.com application for on-demand translation.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
In this slides the basic concept of machine translation is described.MT challenges are represented and describes rule-based and statistical MT briefly. Some notes about evaluation is described too
This slides covers introduction about machine translation, some technique using in MT such as example based MT and statistical MT, main challenge facing us in machine translation, and some examples of application using in MT
Machine translation is an easy tool for translating text from one language to another. You've probably used it. But do you know what machine translation really is? Or when you should or shouldn't use it? Navigate through this presentation to learn more!
This was a talk given at the annual GALA conference in Amsterdam on March 27th 2017. The topic is Neural Machine Translation. Where are we now?
Neural Machine Translation is at the peak of a hype cycle. There is no doubt it is an emerging technology with massive potential, but it is not yet a sweeping solution to all ills. Several factors prevent NMT from being commercially ready. Expectations, therefore, need to be managed. That is the goal of this presentation.
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Subject: English 18
Translation and Editing Text
Topic: Techniques in Translation
Techniques in Translation
1. Computer assisted
2. Machine translation
3. Subtitling
4. editing/Post editing
1. COMPUTER-ASSISTED
Computer-assisted translations also called 'computer-aided translation or machine-aided human translation. It is a form of translation wherein human translator creates a target text with the assistance of a computer program. The machine supports a human translator.
What is Computer Aided Translation?
Computer aided translation (also called computer assisted translation) is a system in which a human translator uses a computer in the translation process.
Humans and computers each have their strengths and weaknesses. The idea of computer aided translation (CAT) software is to make the most of the strengths of people and computers.
Translation performed solely by computers ("machine translation") has very poor quality. Meanwhile, no human can translate as fast as a computer can. By using a CAT tool, however, you can gain some of the speed, consistency, and memory benefits of the computer, without sacrificing the high quality of human translation.
Translation Skills: Theory and practice
The theoretical base should include general information regarding the translator's workshop and the issues one should be familiar with.
*Internet
It is worth discussing is the role of the internet as a source of information. It is important to use the translations which have been on the market for some time and are recognized by other people. This is where the internet becomes very useful for it allows us to search forgiven information (google.com, yahoo.com, altavista.com, etc.), use online dictionaries and corpora, or compare different language versions of the same site (Wikipedia the Free Encyclopedia and the ability to switch from different languages defining a given notion-www.wikipedia.org). Google itself is a powerful tool since it allows us not only to search for information on webpages but also it indexes*.doc and *pdf files stored on servers, allowing us to browse through their contents in search for a context.
*Software
A successful translator needs to know how to handle various computer applications in his/her work. That's why basic software used to compress and decompress files should be mentioned (WinZip, WinRAR). PDF and multimedia files readers (images, audio). Last, the use of different word processors, are usually the first application that leads people using a computer for their work. This comprises of spell checking, standard layouts, ability to have some characters appear in bold print, italics, or underlined. We can save documents, so it can be used again, and we can print the documents.
It is important to mention CAT tool, how the
Delivered at Machine Translation Summit during a special workshop on post-editing.
November 3rd 2015
Miami, Florida.
In this talk, we describe the latest advances in the world of commercial and academic machine translation development that are having the effect of improving acceptance of the technology and keeping its users happy.
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
This is the slide used in the oral presentation at PACLING2019.
(For Japanese speakers) 本発表は私の修論発表と同等ですので、日本語がわかる方は以下のスライドの方が読みやすいかもしれません。
https://www.slideshare.net/HayahideYamagishi/ss-181147693/HayahideYamagishi/ss-181147693
Font has been changed the original one (Hiragino Maru Gothic Pro W4) into the other one by the SlideShare.
Delivered at the 29th LocWorld conference.
October 16th 2015
Santa Clara, CA, USA.
In this talk, we describe how we carried out a successful large scale evaluation and deployment of machine translation at RWS.
Machine translation is an easy tool for translating text from one language to another. You've probably used it. But do you know what machine translation really is? Or when you should or shouldn't use it? Navigate through this presentation to learn more!
This was a talk given at the annual GALA conference in Amsterdam on March 27th 2017. The topic is Neural Machine Translation. Where are we now?
Neural Machine Translation is at the peak of a hype cycle. There is no doubt it is an emerging technology with massive potential, but it is not yet a sweeping solution to all ills. Several factors prevent NMT from being commercially ready. Expectations, therefore, need to be managed. That is the goal of this presentation.
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Subject: English 18
Translation and Editing Text
Topic: Techniques in Translation
Techniques in Translation
1. Computer assisted
2. Machine translation
3. Subtitling
4. editing/Post editing
1. COMPUTER-ASSISTED
Computer-assisted translations also called 'computer-aided translation or machine-aided human translation. It is a form of translation wherein human translator creates a target text with the assistance of a computer program. The machine supports a human translator.
What is Computer Aided Translation?
Computer aided translation (also called computer assisted translation) is a system in which a human translator uses a computer in the translation process.
Humans and computers each have their strengths and weaknesses. The idea of computer aided translation (CAT) software is to make the most of the strengths of people and computers.
Translation performed solely by computers ("machine translation") has very poor quality. Meanwhile, no human can translate as fast as a computer can. By using a CAT tool, however, you can gain some of the speed, consistency, and memory benefits of the computer, without sacrificing the high quality of human translation.
Translation Skills: Theory and practice
The theoretical base should include general information regarding the translator's workshop and the issues one should be familiar with.
*Internet
It is worth discussing is the role of the internet as a source of information. It is important to use the translations which have been on the market for some time and are recognized by other people. This is where the internet becomes very useful for it allows us to search forgiven information (google.com, yahoo.com, altavista.com, etc.), use online dictionaries and corpora, or compare different language versions of the same site (Wikipedia the Free Encyclopedia and the ability to switch from different languages defining a given notion-www.wikipedia.org). Google itself is a powerful tool since it allows us not only to search for information on webpages but also it indexes*.doc and *pdf files stored on servers, allowing us to browse through their contents in search for a context.
*Software
A successful translator needs to know how to handle various computer applications in his/her work. That's why basic software used to compress and decompress files should be mentioned (WinZip, WinRAR). PDF and multimedia files readers (images, audio). Last, the use of different word processors, are usually the first application that leads people using a computer for their work. This comprises of spell checking, standard layouts, ability to have some characters appear in bold print, italics, or underlined. We can save documents, so it can be used again, and we can print the documents.
It is important to mention CAT tool, how the
Delivered at Machine Translation Summit during a special workshop on post-editing.
November 3rd 2015
Miami, Florida.
In this talk, we describe the latest advances in the world of commercial and academic machine translation development that are having the effect of improving acceptance of the technology and keeping its users happy.
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
This is the slide used in the oral presentation at PACLING2019.
(For Japanese speakers) 本発表は私の修論発表と同等ですので、日本語がわかる方は以下のスライドの方が読みやすいかもしれません。
https://www.slideshare.net/HayahideYamagishi/ss-181147693/HayahideYamagishi/ss-181147693
Font has been changed the original one (Hiragino Maru Gothic Pro W4) into the other one by the SlideShare.
Delivered at the 29th LocWorld conference.
October 16th 2015
Santa Clara, CA, USA.
In this talk, we describe how we carried out a successful large scale evaluation and deployment of machine translation at RWS.
These slides are a combination of a 3 different presentations given at LocWorld 31, the TAUS Industry Leaders Forum, and the TAUS QE Summit all held in Dublin, Ireland from June 6-10.
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
This was a pitch for Iconic's neural machine translation technology given at the TAUS Annual Conference in Portland, Oregan on October 24th, 2016.
There has been a lot of talk, and a lot of hype about neural machine translation in the press. But not a lot of practical application. Let's change the conversation
Delivered at the TAUS Quality Evaluation Summit.
May 28th 2015
Dublin, Ireland.
In this talk, we describe how to carry out machine translation evaluation in order to extract meaningful business intelligence.
This was a presentation given at the conference of the Association of Machine Translation in the Americas (AMTA) in Austin, Texas on October 31st, 2016. This is a predominantly academic event, and this presentation was a condensed version of our "MT Success Blog Series" on our website where we aimed to give the community and idea as to the practical considerations around commercial machine translation.
http://iconictranslation.com/2016/07/8-steps-to-mt-success-series-introduction/
Delivered at the 26th LocWorld Conference in North America.
October 31st 2014
Vancouver, Canada.
In this talk, we describe the various strands of knowledge - machine translation, language, and industry - require to develop effective MT software.
Delivered at the biannual conference of Association of Machine Translation in the Americas (AMTA 2014)
October 24th 2014
Vancouver, Canada.
In this talk, we describe how state-of-the-art research lead to the establishment of Iconic Translation Machines.
Compared with the MT result with European languages, that of Asian languages was an annoyance because of their poor result, little below average at best, which was fairly behind in utilizing it in actual MT translation. However, recent MT technology development integrated with the Asian point of view, not with the European one, seems to have changed and improved the translation quality. We have had a very good result in EN-ZH (English-Chinese) last year, which is almost the same quality as the ones of European languages. We have an evaluation on MT results with other Asian languages, such as Indonesian, Malay, and Vietnamese. In this session, we will show you the results and show you whether the door to the next era for Asian languages has been open, or not.
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
Our talk at CHI2015 in Seoul, South Korea. Find more information at www.kotarohara.com .
YouTube: https://youtu.be/isqsYLkX9gA
Makeability Lab: http://www.cs.umd.edu/~jonf/
Microsoft Research: http://research.microsoft.com/
ABSTRACT
Language barrier is the primary challenge for effective cross-lingual conversations. Spoken language translation (SLT) is perceived as a cost-effective alternative to less affordable human interpreters, but little research has been done on how people interact with such technology. Using a prototype translator application, we performed a formative evaluation to elicit how people interact with the technology and adapt their conversation style. We conducted two sets of studies with a total of 23 pairs (46 participants). Participants worked on storytelling tasks to simulate natural conversations with 3 different interface settings. Our findings show that collocutors naturally adapt their style of speech production and comprehension to compensate for inadequacies in SLT. We conclude the paper with the design guidelines that emerged from the analysis.
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.
Choosing the English That’s Right for You: Simplified Technical English and O...Scott Abel
Presented at Documentation and Training East 2008 (October 29-November 1, 2008) by Brenda Huettner and Alison Huettner.
Simplified Technical English (STE) is a success story for the aerospace industry. Will a simplified English work for your industry as well? This session explores the rationale behind simplified languages, their advantages and their perennial challenges. It surveys controlled languages from their beginnings to the offerings in today’s marketplace. The session will also cover the questions you need to ask to determine what’s right for your situation. Do you need to simplify? Can you adapt an existing language or lexicon? Or should you define your own set of rules and phrases? Where should you begin? What effort would be required?
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
"Machine Translation 101" and the Challenge of Patents
1. “Machine Translation 101”
And The Challenge of Patents
John Tinsley
Director / Co-Founder
EPOPIC. 5th Nov 2014, Warsaw
2. The need for translation
50% of all PCT applications in 2013 came from Asia
3. BSc in Computational Linguistics
PhD in Machine Translation
Language Technology consultant
Founder of Iconic Translation Machines
Why listen to me?
Machine Translation is what I do!
The world’s first and only patent specific machine translation system
4. § The use of computers to translate from one language into another
§ The use of computers to automate some, or all, of the translation
process
§ An approach to Machine Translation, where translations for an input are
estimated based on previous seen translation examples and associated
(inferred) probabilities.
§ e.g. IPTranslator, Google Translate
§ Rule-based (or transfer-based): based on linguistic rules
• e.g. Systran; Altavista’s Babelfish
§ Example-based: based on translation examples and inferred linguistic
patterns
Machine Translation: The Basics
Machine Translation = automatic translation
Statistical Machine Translation (SMT)
Other approaches
SMT is now by far the predominant approach
5. A corpus (pl. corpora) is a collection
of texts, in electronic format, in a
single language
§ document(s)
§ book(s)
Bilingual Corpora
a bilingual corpus
Note source language = original language or language we’re translating from
target language = language we’re translating into
A bilingual corpus is a collection of
corresponding texts, in multiple
languages
§ a document & its translation
§ a book in multiple languages
§ European Parliament proceedings
6. Aligned Bilingual Corpora
A document-aligned bilingual corpus corresponds on a document
level
For translation, we required sentence-aligned bilingual corpora
§ The sentence on line 1 in the source language text corresponds
to (i.e. is a translation of) the sentence on line 1 in the target
language text etc.
§ Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora
are essential for statistical machine translation
7. Learning from Previous Translations
Suppose we already know
(from a sentence-aligned bilingual
corpus) that:
§ “dog” is translated as “perro”
§ “I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:
§ “I have a dog” à “Tengo un perro”
§ Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on
previously known translations:
§ Primarily co-occurrence statistics
§ Takes contextual information into account
10. Statistical Machine Translation
§ From the corpus we can infer possible target (French)
translations for various source (English) words
§ We can then select the most probable translations
based on simple frequencies (co-occurrence statistics)
12. Advanced MT
All modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translation
Previous example is very simplistic
§ In reality SMT systems calculate much more complex statistical models
over millions of sentence pairs for a pair of languages
§ Upwards of 2M sentence pairs on average for large-scale systems
§ Word-to-word translation probabilities
§ Phrase-to-phrase translation probabilities
§ Word order probabilities
§ Linguistic information (are the words nouns, verbs?)
§ Fluency of the final output
Previous example is very simplistic
Other statistics calculated include
13. Data is Key
For SMT data is key
§ Information (word/phrase correspondences and associated statistics) is only based
on what we have seen before in the data
Important that data used to train SMT systems is:
§ Of sufficient size
§ avoid sparseness/skewed statistics
§ Representative and relevant
§ contains the right type of language
§ High-quality
§ absence of misspellings,
incorrect alignments etc.
§ Proofed by human
translators
training data
14. Why is MT Difficult?
A word or a phrase can have more than one meaning (ambiguity – lexical or
structural)
§ e.g. “bank”, “dive”, “I saw the man with the telescope”
People use language creatively
§ New words are cropping up all the time
Linguistic differences between languages
§ e.g. structure of Irish sentences vs. structure of English sentences:
§ “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning.
§ “New York”, “The Big Apple”, “NYC”
15. Why is MT Difficult?
§ Israeli officials are responsible for airport security.
§ Israel is in charge of the security at this airport.
§ The security work for this airport is the responsibility of the Israel government.
§ Israeli side was in charge of the security of this airport.
§ Israel is responsible for the airport’s security.
§ Israel is responsible for safety work at this airport.
§ Israel presides over the security of the airport.
§ Israel took charge of the airport security.
§ The safety of this airport is taken charge of by Israel.
§ This airport’s security is the responsibility of the Israeli security officials.
16. No single solution for all languages
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
17. No single solution for all languages
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
18. The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
19. The Challenge of Patents
Very
long
sentences
as
standard
Gramma1cally
incomplete
using
nominal
and
telegraphic
style
(!)
Passive
forms
are
frequent
Frequent
use
of
subordinate
clauses,
par1ciples,
implicit
constructs
Inconsistent
and
incorrect
spelling
High
use
of
neologisms
Instances
of
synonymy
and
polysemy
Spurious
use
of
punctua1on
Authoring guide
for “to be
translated” text
Patents break
almost all of the
rules!
20. Judge the quality of an MT system by comparing its output against a
human-produced “reference” translation
§ Pros: Quick, cheap, consistent
§ Cons: Inflexible, cannot be used on ‘new’ input
§ Pros: Reliable, flexible, multi-faceted (fluency, error analyses,
benchmarking)
§ Cons: Slow, expensive, subjective
§ Fluency vs. Adequacy
Evaluating Machine Translation Quality
Automatic Evaluation
Human Evaluation
Task-Based Evaluation
21. Evaluating Machine Translation Quality
Task Based Evaluation
§ Standalone evaluation of MT systems is necessary to get a sense of the
overall quality of a system
§ To determine the ultimate usability of an MT system, intrinsic task-based
evaluation is required
§ Why? Fluency vs. Adequacy
Fluency how fluent and grammatically correct the translation
output is
Adequacy how accurately the translation conveys the meaning of the
source
Output 1 The big blue house
Output 2 The big house red
Source La gran casa roja
Task-Based Evaluation
22. Practical uses of Machine Translation
Understand its limitations and you’ll understand
its capabilities!
No
§ Translate a patent for filing
§ Translate literature for
publication
§ Translate marketing materials
§ Anything mission critical
without review
Yes
§ Productivity tool for
professional translation
§ Understand foreign patents
§ Localisation processes and
“controlled’ content
§ High volume, e.g. eDiscovery
26. Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
27. Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
28. What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
29. How this impacts translation quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English
Second point is important. It has different uses and usability. The concept of FAHQMT is no more. Focus is now on HAMT and PEMT.
Problems with rule-based is that they didn’t scale
You need bilingual experts for each language pair
SMT is the predominant approach
Starting point for all systems is data.
The most important aspect is the quality of the data…
They are essential and the quality is crucial.
The translations must be accurate and the alignment must be correct, otherwise we infer the wrong things. Introduce “noise” into our systems.
How do we use these corpora? It’s all about learning and remembering things we’ve seen before, the same way you might go about translating something
Ok, so the translation isn’t exactly right here. It should be “Je parle a la fille” but we haven’t seen enough examples (don’t have enough data) for reliable estimates, we’re just going on the counts of the words
How likely a word is to translate to another word – as you have seen
How likely the different phrases are to translate as one another
What’s the likelihood a certain word will have a different position in the target sentence
Sometimes we take into account linguistic information about the words, is it a verb, then it should go here, articles should proceed nouns, etc.
Look at models of the target language and see if what we have produce makes sense (can these words go together in this order?)
Google Translate aims to be a general system, but what happens when your translating a sports website? Quality issues can be caused by the fact that there’s a lot of other data in their models than sports news.
Similarly, if I have a translation system for car manuals, it won’t be any good at translating sports websites.
This is reflected in our systems at IPTranslator too where all of our models are built using patents which have been filed in multiple languages to ensure we get the style correct
(patents are a bigger fish than this though)
The simple answer is that language is complex! Which is what makes it difficult to learn but also so interesting at the same time!
Who has the telescope, him or I?
New words, especially in patents. And new usage of words. The verb “to tweet” didn’t exist so long ago…
The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding, for each language pair, what the differences are between them, e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people!
Chinese, need to identify these DE constructions so we know to move the head noun
No tense, going into English, how do we know what tense?
There’s no article! We have to generate it!
DE particle has many translations, which one!
FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese!
ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
But of course it’s not just that easy.
Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software.
Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim).
Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
And then we have patents which introduce a whole new level of complexity on top of the language issues…
Patents are hard to read, never mind translate, never mind try to teach a computer how to translate them!
Sometimes it’s hard to tell whether the translation is bad or that’s simply how the original patent was written
Commercial machine translation is plagued with misleading marketing with unrealistic claims and promises - Need to manage expectations
When I say NO, I mean no in a fully-automatic manner with no human intervention
Filing – not when meaning is CRUCIAL
Publication – no, there will be errors
Marketing – no, not with subtleties, idioms, etc.
If you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content.
The same applies to MT technology (and its providers)
High quality data is essential for most effective approaches to MT. Clean data is engineering to build MT systems. But it is just an ingredient.
You still need to cook the data for the specific language, the specific content type and writing style. This varies from language to language, domain to domain.
We need to know how to cook it, we need to understand the language, the content, the style and not only take this into account, but make integral to the development process. This is linguistic engineering.
As a developer, you cannot be dogmatic when it comes to approaches to MT. We’re not a statistical MT vendor, we don’t focus on Moses, we’re not a rule-based MT vendor. We don’t do hybrid MT.
We do all of them. We call this an “ensemble” approach. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together.
e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing
Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes.
We study. We learn what ensemble works for a particular configuration and that’s what we implement.
Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. By being completely reliant on the data,
We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for.
***ALSO need to develop special processes for languages…
Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever.
Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?...
Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data.
That presents a challenge when the training data changes for each system that’s built. I’ll come back to this later…
General advantages of this approach to MT
All of these examples are using our IPTranslator systems which have been developed for patent machine translation.
First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option.
These results correlated well with human assessment of adequacy, another of which we can look at here…