Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

•

0 likes•254 views

Delivered at the 26th LocWorld Conference in North America. October 31st 2014 Vancouver, Canada. In this talk, we describe the various strands of knowledge - machine translation, language, and industry - require to develop effective MT software.

Technology

“Data & Linguistics”
Delivering Machine Translation with
Subject Matter Expertise
John Tinsley
Director / Co-Founder
Localization World. 31st Oct 2014, Vancouver

Machine Translation
with Subject Matter Expertise

From Data Engineering to
Linguistic Engineering

The world’s ﬁrst and only patent speciﬁc
MT system that’s ready to go

Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data

Patents: an MT nightmare
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words

“Most of these things are not like the other”
Many languages aren’t a dream either
(And teaches the teacher her students language the Arabic)
Spanish – Italian English – Spanish Arabic – English

Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses

Easier said than done
“A very particular set of skills”
MT Knowledge
(from a scientific
perspective)
Domain Knowledge
(the nature of the
content)
Linguistic Knowledge
(the characteristics of
the language)

MT Knowledge
Implementation
•  Computer science!
•  Programming
•  Data structures
•  Algorithms
Science
•  Machine learning
•  Probability theory
•  Bayesian statistics
•  Markov Models

Domain Knowledge
What’s important?
•  Chemical names
•  References to figures
•  Claim cross-references
Where do we learn?
•  Commercial partners
•  LSPs & Translators
•  Research
Consistent across langs?
•  Japanese abstract order
•  Numbering / bullets
•  Document layout
Document types?
•  Patents
•  Applications, reports
•  Pharmaceutical
•  IFUs, labels
Iconic
Translation Machines

Linguistic Knowledge
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French

Linguistic Knowledge
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]

If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse
reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are
associated with a genetic marker, the HLA-B*5801
allele.”
“IPTranslator is perfect for someone who needs to search [patents]
across multiple languages and with is useful in the case of both
patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents

What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=

De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
New PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types

Case Studies
1.  What this approach means straight up in terms of quality…
2.  Productivity gains from using these systems…
3.  As a foundation for client customization…

Case 1: Quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English

Case 1: Quality
2.83
4 3.86
3.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English

Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need

Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)

Case 2: Productivity
“The complexities and unforeseen but inevitable surprises of MT
integration in large scale production processes were handled both
competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)

Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“Measurable productivity gains delivered from the outset”
Performance

Case 2: Productivity
•  Ongoing improvement through feedback from translators
•  Ongoing improvement through the incorporation of post-edits
•  More than 5 million words translated to date for Asian languages
•  Periodic roll-out of new languages over time
Looking forward

Case 3: Customization
-  Modify our patent machine translation engines for
“Written Opinions” on patents
-  0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English

Case 3: Customization
Productivity
threshold
Essentially out of domain – not viable for post-editing

Case 3: Customization
Productivity
threshold
After customization – 25% gain in productivity

All content is not created equal
We cannot afford to be dogmatic when it
comes to MT
Know your subject matter!
Domain specific MT is about more than just
data
Take home messages…
+ Linguistics!

Thank You!
john@iptranslator.com
@IconicTrans

The music-loving Baltic countries are a multilingual hotspot in Europe, with the majority of citizens speaking (and singing) three languages on a daily basis. At the same time, the melodious Baltic languages are famously complex and morphologically rich, containing lots of ambiguity and intricate word agreements. Taken together, these factors make the region a prime spot for driving innovation in language technologies. Tilde, a language technology company specializing in custom MT and terminology services, has leveraged its extensive linguistic experience in the Baltic region to create custom MT systems for a wide variety of languages and domains, helping EU and global companies to boost translation productivity and make their applications multilingual. Tilde recently embarked on the challenging task of building a large-scale MT service for the Latvian government, Hugo.lv. This service was adapted to create a communication tool for the 2015 EU Presidency. The presentation will introduce the audience to languages and MT in the Baltic region and highlight these two case studies, which showcased the crucial role of language technology in enabling multilingual communication in the digital age.

Nietzsche

crisandriu

Notas ingles

Bryan Ivan

Predictive Analysis in Machine Translation is Business Intelligence.

TAUS - The Language Data Network

In all of our translation production activities we are producing data, lots of data. We are not talking now about the actual translations that are stored as translation memory data. These translation memory data have proven to be very valuable over the years and recently again as training data for Machine Translation engines. But in this session we are talking about the other data: data about the translation process. How much time was spent on different tasks, for different languages, content types, per project? What was the quality score for the translator, for the vendor? What was the user feedback on this machine translated support article? How is our MT engine performing? And has it improved since last year, since we have added 13 million more words in the training set? Some of the buyers and providers of translation are further ahead with the use of all these translation management data than others. The TAUS Dynamic Quality Framework (DQF) tracks translation management data through plug-ins that are already available for various translation tools and platforms. The vision is becoming very clear: the translation industry can have its own “Big Data”. In the past couple of months TAUS enterprise members have contributed their wishes and requirements for an industry benchmarking platform for translation quality and productivity. In this session several TAUS members will share and discuss their plans for using DQF and the Quality Dashboard. What data would you like to track? Session host: Daniel Goldschmidt (Microsoft) Presenters and panelists are: Annya Sedakova-Bertram (EMC), Fred Tuinstra (Lionbridge), Achim Ruopp (TAUS)

Machine Translation: The Neural Frontier

Iconic Translation Machines

Topic 2: How to Pump up Your MT Quality (5)

TAUS - The Language Data Network

In this session, with clear focus on Machine Translation (MT) quality, we will discuss different ways to improve MT engines. Which engine do you use and how do you measure improvement? What are the right metrics to evaluate MT quality for the specific content types? How do you interpret and act on the evaluation results? It's fine when errors are labeled and analyzed, but how can that help improve your engine? Are there best practices available? And how about Neural MT? Should we measure that differently? After some use cases shared by the speakers, these questions will be addressed in the break-out session.

Past, Present, and Future: Machine Translation & Natural Language Processing ...

Iconic Translation Machines

"Machine Translation 101" and the Challenge of Patents

Iconic Translation Machines

MT Evaluation: Seeing the Wood for the Trees

Iconic Translation Machines

What? Why? How? Factors that impact the success of commercial MT projects

Iconic Translation Machines

This was a presentation given at the conference of the Association of Machine Translation in the Americas (AMTA) in Austin, Texas on October 31st, 2016. This is a predominantly academic event, and this presentation was a condensed version of our "MT Success Blog Series" on our website where we aimed to give the community and idea as to the practical considerations around commercial machine translation. http://iconictranslation.com/2016/07/8-steps-to-mt-success-series-introduction/

The Latest Advances in Patent Machine Translation

Iconic Translation Machines

Past, Present, and Future: Machine Translation & Natural Language Processing ...

John Tinsley

TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines

TAUS - The Language Data Network

There are a number of current approaches to developing commercial machine translation systems, ranging from do-it-yourself platforms to fully customized development as a professional service. While these various approaches have their relative merits, they all present a number of drawbacks for the end user, be it the inability to handle complex content or a long and expensive period of development and testing. At Iconic Translation Machines, our approach goes beyond basic engineering of data to build MT systems and overcome these drawbacks. We combine deep domain knowledge and linguistic expertise to deliver highly focused MT engines for targeted domains and languages. Our IPTranslator service, for example, has been developed using this approach to produce intelligent MT systems adapted for patent and legal content. We demonstrate how this approach has delivered significant value to end users and describe how these systems serve as an ideal launchpad for ongoing adaptation and optimization. This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.   MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.     For the latest updates go to http://www.statmt.org/mosescore/ or follow us on Twitter - #MosesCore

From the Lab to the Market: Commercialising MT Research

Iconic Translation Machines

12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP

Gala Webminar September 2013

pangeanic

Pi school-dli-presentation de nobili

Deep Learning Italia

Introducing language technology in the editing process: How to do things righ...

Loctimize GmbH

Ajinomatrix v0.6 30-08-21

lurching

Translation Technologies & Business in the FutureMultilizer

Going eXtreme for Healthcare

Koen Vanderkimpen

Internationalizing a Complex B2B Application

bobdonaldson

Viewers also liked

شهاده خبره محمد جلالMahmoud Aly

Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...

TAUS - The Language Data Network

Plantilla hecha bien 2

laura lopez sanchez

What Data would you like to Track? - Fred Tuinstra

TAUS - The Language Data Network

Machine Translation: The Neural Frontier

Iconic Translation Machines

Topic 2: How to Pump up Your MT Quality (5)

TAUS - The Language Data Network

Past, Present, and Future: Machine Translation & Natural Language Processing ...

Iconic Translation Machines

"Machine Translation 101" and the Challenge of Patents

Iconic Translation Machines

MT Evaluation: Seeing the Wood for the Trees

Iconic Translation Machines

What? Why? How? Factors that impact the success of commercial MT projects

Iconic Translation Machines

Viewers also liked (10)

شهاده خبره محمد جلال

Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...

Plantilla hecha bien 2

What Data would you like to Track? - Fred Tuinstra

Machine Translation: The Neural Frontier

Topic 2: How to Pump up Your MT Quality (5)

Past, Present, and Future: Machine Translation & Natural Language Processing ...

"Machine Translation 101" and the Challenge of Patents

MT Evaluation: Seeing the Wood for the Trees

What? Why? How? Factors that impact the success of commercial MT projects

Similar to Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

The Latest Advances in Patent Machine Translation

Iconic Translation Machines

Past, Present, and Future: Machine Translation & Natural Language Processing ...

John Tinsley

TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines

TAUS - The Language Data Network

From the Lab to the Market: Commercialising MT Research

Iconic Translation Machines

12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP

Gala Webminar September 2013

pangeanic

Pi school-dli-presentation de nobili

Deep Learning Italia

Introducing language technology in the editing process: How to do things righ...

Loctimize GmbH

Ajinomatrix v0.6 30-08-21

lurching

Translation Technologies & Business in the FutureMultilizer

Going eXtreme for Healthcare

Koen Vanderkimpen

Internationalizing a Complex B2B Application

bobdonaldson

Living Multiple Lives: The New Technical Communicator

Scott Abel

This presentation delivered by Noz Urbina at the Documentation and Training West 2008 conference (www.doctrain.com) in Vancouver, BC. The world is becoming more and more tech-savvy by the picosecond. More savvy means more demanding! Today organizations need to juggle management of customer-generated content, maximize the use of cross-departmental contributions, and still deliver quality technical communication products to their user base. This presentation takes a low-tech, cross-industry look at why strategies are changing and how organizations are adapting (or not!) to these challenges.

Living Multiple Lives: The New Technical Communicator

Scott Abel

Presented by Noz Urbina at Documentation and Training West, May 6-9, 2008 in Vancouver, BC This presentation is for team leaders, information managers, tech communicators and product managers who care about maximizing efficiency and return on investment in the information-heavy parts of their product cycle. We will discuss current developments in the field of Technical Communications and how the role of the Technical Communicator has been rapidly and fundamentally evolving. The world is becoming more and more tech-savvy by the picosecond. More savvy means more demanding, and an organization’s ability to balance internal and external management of supporting technical information while delivering quality technical communication products has gone from being a burdensome nuisance, to a central and strategic must for market competitiveness. This presentation takes a low-tech, cross-industry look at why strategies are changing and how organizations are adapting to these challenges. Best practices for approach, organizing teams, planning for change, DITA/XML, and departmental integration will all be addressed.

Ajinomatrix v0.5 28-08-21

lurching

Frederic Leoni Linkedinfredleoni

Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)

TAUS - The Language Data Network

Panelists: Yoshiyasu Yamakawa (Intel), JP Barraza (Systran), Konstantin Dranch (Memsource), David Koot (TAUS) The focus of this session will be on predictions and risk management. What kind of things can you predict and how can you manage risks by by analyzing your translation data or monitoring your productivity and quality. Tracking translation data in different cycles of the translation process (translation, post-editing, review, proof-reading) offers tremendous value when it comes to predicting future trends or making informed choices. What type of data can be valuable and what kind of predictions can we make using this data? How can we make more efficient use of already available data? How can we use this type of data to improve machine translation, automatic QA, error-recognition, sampling or quality estimation? How can academia and industry work together towards a common goal?

The data limbo in modern biomedical research

Jorge Boucas

Joaquin Pe Fagundo | Technology Transfer Impact

Joaquin Pe Fagundo

Similar to Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise (20)

The Latest Advances in Patent Machine Translation

Past, Present, and Future: Machine Translation & Natural Language Processing ...

TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines

From the Lab to the Market: Commercialising MT Research

12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...

Gala Webminar September 2013

Pi school-dli-presentation de nobili

Introducing language technology in the editing process: How to do things righ...

Ajinomatrix v0.6 30-08-21

Translation Technologies & Business in the Future

Going eXtreme for Healthcare

Internationalizing a Complex B2B Application

Living Multiple Lives: The New Technical Communicator

Ajinomatrix v0.5 28-08-21

Frederic Leoni Linkedin

Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)

The data limbo in modern biomedical research

Joaquin Pe Fagundo | Technology Transfer Impact

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Bits & Pixels using AI for Good.........

Alison B. Lowndes

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Recently uploaded (20)