Delivered at the 26th LocWorld Conference in North America.
October 31st 2014
Vancouver, Canada.
In this talk, we describe the various strands of knowledge - machine translation, language, and industry - require to develop effective MT software.
Delivered at the TAUS Machine Translation Showcase.
June 6th 2014
Dublin, Ireland.
In this talk, we explain how machine translation systems can be developed for highly technical content types.
Delivered at Machine Translation Summit during a special workshop on MT for patent and scientific literature.
October 30th 2015
Miami, Florida.
In this talk, we describe how we adapted machine translation for patents to help a translation company improve their productivity.
These slides are a combination of a 3 different presentations given at LocWorld 31, the TAUS Industry Leaders Forum, and the TAUS QE Summit all held in Dublin, Ireland from June 6-10.
Delivered at the 29th LocWorld conference.
October 16th 2015
Santa Clara, CA, USA.
In this talk, we describe how we carried out a successful large scale evaluation and deployment of machine translation at RWS.
The music-loving Baltic countries are a multilingual hotspot in Europe, with the majority of citizens speaking (and singing) three languages on a daily basis. At the same time, the melodious Baltic languages are famously complex and morphologically rich, containing lots of ambiguity and intricate word agreements. Taken together, these factors make the region a prime spot for driving innovation in language technologies. Tilde, a language technology company specializing in custom MT and terminology services, has leveraged its extensive linguistic experience in the Baltic region to create custom MT systems for a wide variety of languages and domains, helping EU and global companies to boost translation productivity and make their applications multilingual. Tilde recently embarked on the challenging task of building a large-scale MT service for the Latvian government, Hugo.lv. This service was adapted to create a communication tool for the 2015 EU Presidency. The presentation will introduce the audience to languages and MT in the Baltic region and highlight these two case studies, which showcased the crucial role of language technology in enabling multilingual communication in the digital age.
Tony O’Dowd (KantanMT). KantanMT enables its community to generate meaningful business intelligence that helps them identify the scope of their customised machine translation projects. More importantly, it helps them schedule and scale those projects to achieve maximum translation productivity and a positive ROI.
Delivered at the TAUS Machine Translation Showcase.
June 6th 2014
Dublin, Ireland.
In this talk, we explain how machine translation systems can be developed for highly technical content types.
Delivered at Machine Translation Summit during a special workshop on MT for patent and scientific literature.
October 30th 2015
Miami, Florida.
In this talk, we describe how we adapted machine translation for patents to help a translation company improve their productivity.
These slides are a combination of a 3 different presentations given at LocWorld 31, the TAUS Industry Leaders Forum, and the TAUS QE Summit all held in Dublin, Ireland from June 6-10.
Delivered at the 29th LocWorld conference.
October 16th 2015
Santa Clara, CA, USA.
In this talk, we describe how we carried out a successful large scale evaluation and deployment of machine translation at RWS.
The music-loving Baltic countries are a multilingual hotspot in Europe, with the majority of citizens speaking (and singing) three languages on a daily basis. At the same time, the melodious Baltic languages are famously complex and morphologically rich, containing lots of ambiguity and intricate word agreements. Taken together, these factors make the region a prime spot for driving innovation in language technologies. Tilde, a language technology company specializing in custom MT and terminology services, has leveraged its extensive linguistic experience in the Baltic region to create custom MT systems for a wide variety of languages and domains, helping EU and global companies to boost translation productivity and make their applications multilingual. Tilde recently embarked on the challenging task of building a large-scale MT service for the Latvian government, Hugo.lv. This service was adapted to create a communication tool for the 2015 EU Presidency. The presentation will introduce the audience to languages and MT in the Baltic region and highlight these two case studies, which showcased the crucial role of language technology in enabling multilingual communication in the digital age.
Tony O’Dowd (KantanMT). KantanMT enables its community to generate meaningful business intelligence that helps them identify the scope of their customised machine translation projects. More importantly, it helps them schedule and scale those projects to achieve maximum translation productivity and a positive ROI.
Predicting the quality of an MT engine without existing target reference is one of the tricky part in MT technology. It plays an essential role in making MT usable in real life scenarios. Perspective by Gábor Bessenyei (CEO of MorphoLogic Localisation Ltd.).
In all of our translation production activities we are producing data, lots of data. We are not talking now about the actual translations that are stored as translation memory data. These translation memory data have proven to be very valuable over the years and recently again as training data for Machine Translation engines. But in this session we are talking about the other data: data about the translation process. How much time was spent on different tasks, for different languages, content types, per project? What was the quality score for the translator, for the vendor? What was the user feedback on this machine translated support article? How is our MT engine performing? And has it improved since last year, since we have added 13 million more words in the training set? Some of the buyers and providers of translation are further ahead with the use of all these translation management data than others. The TAUS Dynamic Quality Framework (DQF) tracks translation management data through plug-ins that are already available for various translation tools and platforms. The vision is becoming very clear: the translation industry can have its own “Big Data”. In the past couple of months TAUS enterprise members have contributed their wishes and requirements for an industry benchmarking platform for translation quality and productivity. In this session several TAUS members will share and discuss their plans for using DQF and the Quality Dashboard. What data would you like to track?
Session host: Daniel Goldschmidt (Microsoft)
Presenters and panelists are: Annya Sedakova-Bertram (EMC), Fred Tuinstra (Lionbridge), Achim Ruopp (TAUS)
This was a pitch for Iconic's neural machine translation technology given at the TAUS Annual Conference in Portland, Oregan on October 24th, 2016.
There has been a lot of talk, and a lot of hype about neural machine translation in the press. But not a lot of practical application. Let's change the conversation
In this session, with clear focus on Machine Translation (MT) quality, we will discuss different ways to improve MT engines. Which engine do you use and how do you measure improvement? What are the right metrics to evaluate MT quality for the specific content types? How do you interpret and act on the evaluation results? It's fine when errors are labeled and analyzed, but how can that help improve your engine? Are there best practices available? And how about Neural MT? Should we measure that differently? After some use cases shared by the speakers, these questions will be addressed in the break-out session.
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
Delivered at the European Patent Office's annual Patent Information Conference (EPOPIC 2014)
November 5th 2014
Warsaw, Poland.
In this talk, we give an introduction as to how machine translation works and what makes certain content types and languages more difficult than others.
Delivered at the TAUS Quality Evaluation Summit.
May 28th 2015
Dublin, Ireland.
In this talk, we describe how to carry out machine translation evaluation in order to extract meaningful business intelligence.
This was a presentation given at the conference of the Association of Machine Translation in the Americas (AMTA) in Austin, Texas on October 31st, 2016. This is a predominantly academic event, and this presentation was a condensed version of our "MT Success Blog Series" on our website where we aimed to give the community and idea as to the practical considerations around commercial machine translation.
http://iconictranslation.com/2016/07/8-steps-to-mt-success-series-introduction/
Delivered at the European Patent Office's Patent Information Conference.
November 11th 2015
Miami, Florida.
In this talk, we talk about recent advances in MT for patents and introduce our IPTranslator.com application for on-demand translation.
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
There are a number of current approaches to developing commercial machine translation systems, ranging from do-it-yourself platforms to fully customized development as a professional service. While these various approaches have their relative merits, they all present a number of drawbacks for the end user, be it the inability to handle complex content or a long and expensive period of development and testing.
At Iconic Translation Machines, our approach goes beyond basic engineering of data to build MT systems and overcome these drawbacks. We combine deep domain knowledge and linguistic expertise to deliver highly focused MT engines for targeted domains and languages. Our IPTranslator service, for example, has been developed using this approach to produce intelligent MT systems adapted for patent and legal content. We demonstrate how this approach has delivered significant value to end users and describe how these systems serve as an ideal launchpad for ongoing adaptation and optimization.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
Delivered at the biannual conference of Association of Machine Translation in the Americas (AMTA 2014)
October 24th 2014
Vancouver, Canada.
In this talk, we describe how state-of-the-art research lead to the establishment of Iconic Translation Machines.
Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.
Internationalizing a Complex B2B Applicationbobdonaldson
This is a joint presentation made at Localization World 2013 in Singapore. It developed out of a client engagement seeking an optimal solution to taking a large complex service across language borders.
Predicting the quality of an MT engine without existing target reference is one of the tricky part in MT technology. It plays an essential role in making MT usable in real life scenarios. Perspective by Gábor Bessenyei (CEO of MorphoLogic Localisation Ltd.).
In all of our translation production activities we are producing data, lots of data. We are not talking now about the actual translations that are stored as translation memory data. These translation memory data have proven to be very valuable over the years and recently again as training data for Machine Translation engines. But in this session we are talking about the other data: data about the translation process. How much time was spent on different tasks, for different languages, content types, per project? What was the quality score for the translator, for the vendor? What was the user feedback on this machine translated support article? How is our MT engine performing? And has it improved since last year, since we have added 13 million more words in the training set? Some of the buyers and providers of translation are further ahead with the use of all these translation management data than others. The TAUS Dynamic Quality Framework (DQF) tracks translation management data through plug-ins that are already available for various translation tools and platforms. The vision is becoming very clear: the translation industry can have its own “Big Data”. In the past couple of months TAUS enterprise members have contributed their wishes and requirements for an industry benchmarking platform for translation quality and productivity. In this session several TAUS members will share and discuss their plans for using DQF and the Quality Dashboard. What data would you like to track?
Session host: Daniel Goldschmidt (Microsoft)
Presenters and panelists are: Annya Sedakova-Bertram (EMC), Fred Tuinstra (Lionbridge), Achim Ruopp (TAUS)
This was a pitch for Iconic's neural machine translation technology given at the TAUS Annual Conference in Portland, Oregan on October 24th, 2016.
There has been a lot of talk, and a lot of hype about neural machine translation in the press. But not a lot of practical application. Let's change the conversation
In this session, with clear focus on Machine Translation (MT) quality, we will discuss different ways to improve MT engines. Which engine do you use and how do you measure improvement? What are the right metrics to evaluate MT quality for the specific content types? How do you interpret and act on the evaluation results? It's fine when errors are labeled and analyzed, but how can that help improve your engine? Are there best practices available? And how about Neural MT? Should we measure that differently? After some use cases shared by the speakers, these questions will be addressed in the break-out session.
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
Delivered at the European Patent Office's annual Patent Information Conference (EPOPIC 2014)
November 5th 2014
Warsaw, Poland.
In this talk, we give an introduction as to how machine translation works and what makes certain content types and languages more difficult than others.
Delivered at the TAUS Quality Evaluation Summit.
May 28th 2015
Dublin, Ireland.
In this talk, we describe how to carry out machine translation evaluation in order to extract meaningful business intelligence.
This was a presentation given at the conference of the Association of Machine Translation in the Americas (AMTA) in Austin, Texas on October 31st, 2016. This is a predominantly academic event, and this presentation was a condensed version of our "MT Success Blog Series" on our website where we aimed to give the community and idea as to the practical considerations around commercial machine translation.
http://iconictranslation.com/2016/07/8-steps-to-mt-success-series-introduction/
Delivered at the European Patent Office's Patent Information Conference.
November 11th 2015
Miami, Florida.
In this talk, we talk about recent advances in MT for patents and introduce our IPTranslator.com application for on-demand translation.
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
This was a presentation given at the European Patent Office's annual Patent Information Conference in Madrid, Spain on November 10th, 2016.
In it, we give an overview of how machine translation works, latest advances in neural MT, and how this can be applied to patents and intellectual property content, not only for translations but also information extraction and other NLP applications.
There are a number of current approaches to developing commercial machine translation systems, ranging from do-it-yourself platforms to fully customized development as a professional service. While these various approaches have their relative merits, they all present a number of drawbacks for the end user, be it the inability to handle complex content or a long and expensive period of development and testing.
At Iconic Translation Machines, our approach goes beyond basic engineering of data to build MT systems and overcome these drawbacks. We combine deep domain knowledge and linguistic expertise to deliver highly focused MT engines for targeted domains and languages. Our IPTranslator service, for example, has been developed using this approach to produce intelligent MT systems adapted for patent and legal content. We demonstrate how this approach has delivered significant value to end users and describe how these systems serve as an ideal launchpad for ongoing adaptation and optimization.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
Delivered at the biannual conference of Association of Machine Translation in the Americas (AMTA 2014)
October 24th 2014
Vancouver, Canada.
In this talk, we describe how state-of-the-art research lead to the establishment of Iconic Translation Machines.
Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.
Internationalizing a Complex B2B Applicationbobdonaldson
This is a joint presentation made at Localization World 2013 in Singapore. It developed out of a client engagement seeking an optimal solution to taking a large complex service across language borders.
Living Multiple Lives: The New Technical CommunicatorScott Abel
This presentation delivered by Noz Urbina at the Documentation and Training West 2008 conference (www.doctrain.com) in Vancouver, BC.
The world is becoming more and more tech-savvy by the picosecond. More savvy means more demanding! Today organizations need to juggle management of customer-generated content, maximize the use of cross-departmental contributions, and still deliver quality technical communication products to their user base. This presentation takes a low-tech, cross-industry look at why strategies are changing and how organizations are adapting (or not!) to these challenges.
Living Multiple Lives: The New Technical CommunicatorScott Abel
Presented by Noz Urbina at Documentation and Training West, May 6-9, 2008 in Vancouver, BC
This presentation is for team leaders, information managers, tech communicators and product managers who care about maximizing efficiency and return on investment in the information-heavy parts of their product cycle.
We will discuss current developments in the field of Technical Communications and how the role of the Technical Communicator has been rapidly and fundamentally evolving. The world is becoming more and more tech-savvy by the picosecond. More savvy means more demanding, and an organization’s ability to balance internal and external management of supporting technical information while delivering quality technical communication products has gone from being a burdensome nuisance, to a central and strategic must for market competitiveness.
This presentation takes a low-tech, cross-industry look at why strategies are changing and how organizations are adapting to these challenges. Best practices for approach, organizing teams, planning for change, DITA/XML, and departmental integration will all be addressed.
Panelists: Yoshiyasu Yamakawa (Intel), JP Barraza (Systran), Konstantin Dranch (Memsource), David Koot (TAUS)
The focus of this session will be on predictions and risk management. What kind of things can you predict and how can you manage risks by by analyzing your translation data or monitoring your productivity and quality. Tracking translation data in different cycles of the translation process (translation, post-editing, review, proof-reading) offers tremendous value when it comes to predicting future trends or making informed choices. What type of data can be valuable and what kind of predictions can we make using this data? How can we make more efficient use of already available data? How can we use this type of data to improve machine translation, automatic QA, error-recognition, sampling or quality estimation? How can academia and industry work together towards a common goal?
Joaquin Pe Fagundo is telling about the Technology Transfer Impact. Joaquin Fagundo is an Information technology professional and he is so bright in software development and testing stuff.
Similar to Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise
1. “Data & Linguistics”
Delivering Machine Translation with
Subject Matter Expertise
John Tinsley
Director / Co-Founder
Localization World. 31st Oct 2014, Vancouver
5. The world’s first and only patent specific
MT system that’s ready to go
6. Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
7. Patents: an MT nightmare
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
8. “Most of these things are not like the other”
Many languages aren’t a dream either
(And teaches the teacher her students language the Arabic)
Spanish – Italian English – Spanish Arabic – English
9. Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
10. Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
11. Easier said than done
“A very particular set of skills”
MT Knowledge
(from a scientific
perspective)
Domain Knowledge
(the nature of the
content)
Linguistic Knowledge
(the characteristics of
the language)
12. MT Knowledge
Implementation
• Computer science!
• Programming
• Data structures
• Algorithms
Science
• Machine learning
• Probability theory
• Bayesian statistics
• Markov Models
13. Domain Knowledge
What’s important?
• Chemical names
• References to figures
• Claim cross-references
Where do we learn?
• Commercial partners
• LSPs & Translators
• Research
Consistent across langs?
• Japanese abstract order
• Numbering / bullets
• Document layout
Document types?
• Patents
• Applications, reports
• Pharmaceutical
• IFUs, labels
Iconic
Translation Machines
14. Linguistic Knowledge
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
15. Linguistic Knowledge
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
16. If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse
reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are
associated with a genetic marker, the HLA-B*5801
allele.”
“IPTranslator is perfect for someone who needs to search [patents]
across multiple languages and with is useful in the case of both
patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
17. What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
18. De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
New PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types
19. Case Studies
1. What this approach means straight up in terms of quality…
2. Productivity gains from using these systems…
3. As a foundation for client customization…
21. Case 1: Quality
2.83
4 3.86
3.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English
22. Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
23. Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
24. Case 2: Productivity
“The complexities and unforeseen but inevitable surprises of MT
integration in large scale production processes were handled both
competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
25. Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“Measurable productivity gains delivered from the outset”
Performance
26. Case 2: Productivity
• Ongoing improvement through feedback from translators
• Ongoing improvement through the incorporation of post-edits
• More than 5 million words translated to date for Asian languages
• Periodic roll-out of new languages over time
Looking forward
27. Case 3: Customization
- Modify our patent machine translation engines for
“Written Opinions” on patents
- 0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English
30. All content is not created equal
We cannot afford to be dogmatic when it
comes to MT
Know your subject matter!
Domain specific MT is about more than just
data
Take home messages…
+ Linguistics!
In this presentation, I’m going to talk about our experience of developing machine translation engines for complex content and languages. Looking at where were get to when we reach the limitations existing technology and approaches, particularly focusing on WHY we reached that ceiling *WHAT was it about the content and the language the could be overcome.
From there, I’ll look at what we need to do to advance the technology from there and, FROM OUR PERSPECTIVE as MT technology developers and providers, tell you about what we discovered we needed to know, what skillsets and knowhow we needed in our team to achieve this. I’ll then WRAP UP with some case studies which will serve to illustrate the benefits that can be seen as a result of taking this approach.
For DEVELOPERS, I hope we can share our experiences with you, and for BUYERS OR USERS OF MT, my hope is that, from your perspective, this talk will pull back the curtain a little bit on MT development, which has kinda been a bit of a black box.
Just a little bit by way of an overview of Iconic Translation Machines to introduce the concepts I'm going to talk about. We develop what we call “MT with Subject Matter Expertise”
The concept is that if you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content.
*And the same applies to MT technology*
----- Meeting Notes (14/10/2014 12:52) -----
Our philosophy
High quality data is essential for most effective approaches to MT. Clean data is engineering to build MT systems. But it is just an ingredient.
You still need to cook the data for the specific language, the specific content type and writing style. This varies from language to language, domain to domain.
We need to know how to cook it, we need to understand the language, the content, the style and not only take this into account, but make integral to the development process. This is linguistic engineering.
How do you go about building such a concept? To answer this, I want to introduce the concept of the ensemble architecture for machine translation
As a developer, you cannot be dogmatic when it comes to approaches to MT. There are many approaches, you cab be a statistical MT vendor, we you can focus on Moses, you can use a rule-based MT. Or you might do some sort of hybrid MT.
In the “ensemble” approach, WE DO ALL OF THEM. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together.
e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing
Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes.
You study. You learn what ensemble works for a particular configuration and that’s what you implement.
An instance of this approach is our IPTranslator service for patent/IP/legal translation and I’ll mention patents as an example of a highly complex content type as I go through the rest of the presentation.
TO understand this Linguistic Engineering approach, let’s first describe DATA ENGINEERING
Existing approaches to MT typically use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent – AND THAT’s WHY IT’S USED, BECAUSE IT CAN WORK - but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you often need A LOT of data (and many clients simply don’t have it.) But then your being completely reliant on the data to capture all of the nuances of language and content, and this isn’t enough.
We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for.
***ALSO need to develop special processes for languages…
LET’S LOOK AT WHY
But of course it’s not just that easy.
Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software.
Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim).
Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
To quote Sesame Street…or to slightly modify a line from a famous Sesame street song… “Most of these things are not like the other”
AS A RULE OF THUMB, the more similar languages are to one another, the easier they are for machine translation. Particularly in terms of the order of the words in sentence, and then also grammatically.
The closer they are, the more you can get away with just using statistical MT and throwing lots of data into the system. But most of them are not like one anothr
But what if the languages are SO grammatically different from all perspectives?! Like English and Arabic, where Arabic has a different word order, frequently doesn’t have a verb, affixes pronouns, articles, and conjunctions to verbs (when they ARE there) and nouns.
Look at this example which shows many of these phenomenon together. Firstly, the words are in a totally different order if we read it out as it would be word for word… and it manages to say all that in 5 words due to all the affixes, compounding, and morphology
Data cannot solve these problems either. Each one of these phenomena needs to be addressed. And that’s where the linguistic knowledge and linguistic engineering comes in…
Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it.
We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the CONTENT being translated, often technical nuances, terminology etc. that needs to be specially accounted for , ASWELL as the LANGUAGE being translated which again cannot just be a generic process.
Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever.
Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?...
Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data. That presents a challenge when the training data changes for each system that’s built.
LATER, I’ll come back to this and look at some examples where we have QUANTIFIED the impact and the value in taking this approach
BUT FIRST, I want to talk about WHY we took this approach and WHAT we learned over the course of the last few years…
----- Meeting Notes (14/10/2014 12:52) -----
**Good if you can develop the systems with the training data that you know you're going to use...
THAT’S WHAT’S REQUIRED AND DEVELOPMENT OF THE VARIOUS COGS IS AN ONGOING PROCESS.
However, as with most areas of natural language processing (like MT itself as the over-arching process) these things aren’t perfect. You know the way MT is improving, well so is syntactic parsing of German, named-entity recognition in Japanese, Arabic morphologic analysis so it’s about constant iterative improvement. THAT’S WHY THERE ARE NO BREAKTHROUGHS, NO SILVER BULLETS IN MT DEVELOPMENT. We work hard, we improve our German parsing, we improve our German systems a bit…
But all of that is easier said than done. When building a technical team to do this, we have to look closely at what sort of skillset we need. Let me tell you, what we came across is quite the high bar. It’s a talent pool that’s thin on the ground for a number of reasons, which I’ll get to…
To quote another movie, a compatriot of mine, Liam Neeson in the film Taken “You need a very particular set of skills”. Now his is not per person, but these are skills you really need to have within your team to get the most you can out of your MT systems
**NOW START SLIDES** Over the course of our existence, we’ve identify three key areas in which you need to have expertise in order to be able to develop adequate MT engines for different languages and content types…
1…2…3
----- Meeting Notes (14/10/2014 15:58) -----
16 minutes to here
Let’s look first at MT knowledge. THIS IS NOT JUST KNOWING HOW TO RUN MOSES. You can’t treat it as a black box. I believe MT knowledge here is two-fold. You have to know the science (THEORY), and you have to know to implement the science (PRACTICE).
They don’t always go hand in hand…we’re talking implementation from a product development perspective, not from a “let’s hack together my idea in some scripts held together by string so that I can write a paper about my results and it doesn’t really matter how efficiently it works!”
So then if know the theory, we know how to develop a maximum-entropy classifier to identify chemical names in Korean – we then need to understand the mechanics of the MT engine in order to implement this along with all of the other components in an efficient manner.
Examples of machine learning methods: support vector machines, decision trees, neural networks
Examples of probability models: Baysian, HMMs, Maximum Likelihood
Example of programming language/styles: Java, python, C++, MapReduce
Examples of data structures: hashmaps, databases,
Example of algorithms: sorting/searching, parsing,
OUTRO: one of the biggest challenges in this regard is finding talent with this skillset. MT grads and postgraduates are thin on the ground and many of them are on an academic career path. Couple that with the fact that the research groups are dotted around the word makes hiring a real challenge. There was actually an interesting panel about this at the AMTA conference…
So that’s what you need to be able to develop with the MT. With that, what is it that you actually need to develop?
Well, we can split this into two sets of components that need to work together. first is those for the DOMAIN, and then those for the LANGUAGE itself.
Looking at the DOMAIN KNOWLEDGE required first, what do we need to know?
1. WHAT’S IMPORTANT IN THIS DOMAIN?
2. WHAT TYPES OF DOCUMENTS ARE THERE?
3. ARE THESE CHARACTERISTICS CONSISTENT ACROSS LANGUAGES?
4. WHERE DO WE FIND THIS INFORMATION OUT?
The last piece in the puzzle is understanding the languages you’re developing MT systems for.
And that’s not understanding them in isolation – that’s understanding THE RELATIONSHIP between the languages you’re translating to and from, what the differences are between them e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people!
Chinese, need to identify these DE constructions so we know to move the head noun
No tense, going into English, how do we know what tense?
There’s no article! We have to generate it!
DE particle has many translations, which one!
FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese!
ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
The whole motivation for this is that same as if you’re hiring a linguist for translation, they simply need to have technical subject matter expertise. Otherwise, how can they understand everything?
“If you don’t understand it, you can’t translate it”
The same applies to MT. The training and translation process needs to know what it’s dealing with so it can use the right terms, do the right preprocessing, etc.
That’s what we’ve done with our flagship offering, IPTranslator. Systems have subject matter expertise because they were developed with, and evaluated and used by patent information specialists.
General advantages of this approach to MT
ANALOGY of buying fresh fruit…
Obviously one of the issues in adopting machine translation technology is the risk that’s involved. You invest in a program, it doesn’t deliver straight away, it might start brining you returns but when? How long? If ever
If we look specifically at the approach we’ve taken: Our proposition helps to derisk the adoption of MT from A QUALITY PERSPECTIVE and a DELIVERY PERSPECTIVE
Typical setup involves:
data, across all languages. How much do you have? Is that enough? Is it clean? Is it yours to give away?
Time, how long is development going to take? Will MT be good enough straight away after that? If not, when?
What’s the upfront cost for customisation or subscription to the service?
That’s the value for the users for the whole concept, but what if we get down to the nuts and bolts of it and talk about the value in terms of the returns…what does using this type of MT get you?
To give an illustration, I’ll run through 3 quick examples and case studies from our own experiences.
The first of which will look at what this does in terms of straight up quality of the MT output
After that, no pun intended, we’ll see how that translates to productivity when post-editing the output
Finally, we’ll look at what you can do when you have these systems built and ready to go in terms of customisation, with minimal effort..
All of these examples are using our IPTranslator systems which have been developed for patent machine translation.
First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option.
These results correlated well with human assessment of adequacy, another of which we can look at here…
For our German to English system, we had 3 evaluators look at around 400 segments each and rank them from 1-5 in terms of how adequately the carried the meaning from the source to the target. Typically, a score of 3 or high indications the the segments are “usable” – i.e. readable and understandable
So they’re just a couple of brief examples to show that this approach is developing systems that can produce good quality output, without the need for additional adaption for each individual user.
I want to now look at a case study that illustrates how these systems, as they are, with these levels of quality, can produce output that leads to more productive post-editing…
This is a case study with WeLocalize who had a particular business need…
For English to Chinese MT…
Used on a daily basis
So this ongoing improvement through incorporation of client-specific data is related to our third case study about how these engines that we’ve building with linguistic engineering can serve as a solid backbone for customized engines…
This is a case with another of our clients who have a substantial patent translation business.
They had a slightly different need in that, rather than the translation of patent documents themselves, they wanted to translation what are known as Written Opinions, essentially reports from patent examiners about the validity of a patent application. From an MT perspective, when a lot of the technical terminology is the same, the register is completely different. These written opinions contain first person, questions, opinions – sentence structures and words that just aren’t in patents and consequently not in our original systems.
If we looked at how our systems performed when trying to handle this, we get a BLEU score of around 21 where Google, a system designed for whatever’s thrown at it, gets a score of 20 – so around the same.
What we need to do is modify these systems for this particular type of text. What we had at hand to do this was some TMs from our client, not much though, it amounted of around 0.25% of the amount of data we’d trained our original engines with.
We also developed a couple of processes to add to our ensemble architecture to handle specifics of these Reports, such as consistent references to PCT (patent cooperation treaty) Regulations.
This resulted in the performance more than doubling….
In terms of how this correlated into post-editing productivity for the client, well let’s look at this scatter plot.
Each dot is a segment in our test. Along the horizontal axis we have the length of the segment in words. On the vertical axis we have a proprietary score that correlates with post-editing productivity whereby a score of 0.4 means, roughly, there’ll be some productivity from post-editing. Above means most likely not, and the lower, the less editing is required.
So here we can see that only a small portion of the segments fall below the threshold so, basically, the document (which is essentially out of domain) is NOT VIABLE for this MT system.
However, AFTER we do the customisation we see that a large number of the segments drop below the line, a bit over 60% of them, with quite a few hitting the 0 score also.
When we run the number of these, they lead to productivity gains of around 25%
----- Meeting Notes (13/10/2014 17:03) -----
The heavy lifting has been done
Some of these points may be obvious, but allow me to elaborate
All content is not created equal (to modify a well know phrase); as such, the (machine) translation process has to be different
We cannot afford to be dogmatic when it comes to MT; one size does not final all. If we are practitioners of SMT, we’re restricting ourselves. Even being “hybrid” is restrictive. It’s SMT + rules, or rule-based + statistical post-editing.
Domain specific MT is about more than just data; a sufficient amount of good quality clean training data is obviously a key component in the MT training process (especially for SMT) but it’s no everything. To use a cooking analogy, data is to MT what the ingredients are to a chef. The chef (in this case the training/development process) needs to know what to do with the ingredients. To bring it back to MT, the training and translation processes need to be informed by the data, by the content type and the subject matter.
Training is sensitive to data. So you could have the most refined approach but data will be the biggest variable to quality. Our approach allows us to deliver high quality “out of the box” which we then refine as opposed to the great unknown of training from scratch.