From the Lab to the Market: Commercialising MT Research

•

1 like•227 views

Delivered at the biannual conference of Association of Machine Translation in the Americas (AMTA 2014) October 24th 2014 Vancouver, Canada. In this talk, we describe how state-of-the-art research lead to the establishment of Iconic Translation Machines.

Technology

“From the Lab to the Market”
Commercialising MT Research
John Tinsley
Director / Co-Founder
AMTA. 24th Oct 2014. Vancouver

What is Iconic Translation Machines?
We provide Machine Translation
solutions with Subject Matter Expertise
HQ: Dublin, Ireland
Founded: 2012

For the next 20 minutes…
Part 2: The technology that took the journey
Part 1: The journey from the lab to the market

Innovation in academia
Typically
Cutting edge techniques not seeing the light of day
More Recently
Basic research à Applied research
New Performance Metrics
Industry collaboration, software licenses, spin-outs

Lab to Market: The Starting Point…The Lab
The Lab
Dublin City University
The Funding
European Union (FP7 PSP)
The Goal
Adapt existing technology for patent machine translation
So what now?

Lab to Market: Technical Development
The Process
Build with working group: release early, release often
Engagement
Identify broader user base (patent professionals, translators) and field test
The Outcome
Well-developed, well-performing prototype with a user base
Is there a business in this?

Lab to Market: Commercial Viability
Feasibility
Study Grant
iHPSU
Commercialisation
Fund
Is this worth
exploring? •  Develop MVP
•  Business model
•  Product/Market
fit?
•  License IP
•  Spin out
•  Support
•  Exporting
•  Investment
Skillset

Lab to Market: …The Market!
Part 2: The technology that took the journey
Part 1: The journey from the lab to the market

We provide Machine Translation
solutions with Subject Matter Expertise

An “ensemble” MT architecture
built with Linguistic Engineering

Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data

The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words

Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses

De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
Our PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types

Case Studies
1.  What this approach means straight up in terms of quality…
2.  Productivity gains from using these systems…
3.  As a foundation for client customization…

Case 1: Quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English

Case 1: Quality
2.83
4 3.86
3.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English

Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need

Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process

Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“MT delivered measurable productivity gains from the outset”
Performance

Case 3: Customization
-  Modify our patent machine translation engines for
“Written Opinions” on patents
-  0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English

Case 3: Customization
Productivity
threshold
Essentially out of domain – not viable for post-editing

Case 3: Customization
Productivity
threshold
After customization – 25% gain in productivity

•  Do you have technology that can solve a problem?
–  Validate this!
•  Is there a market for this technology?
–  Find product/market fit!
•  Go for it!
•  This is what we did in developing domain-adapted MT
solutions with subject matter expertise for LSPs and data
providers.
•  Enjoy the ride!
Take home messages…
•  Go for it!

Thank You!
john@iptranslator.com
@IconicTrans

What's hot

Mi0040 technology managementsmumbahelp

America's Next Top Energy Innovator ChallengeUS Department of Energy

Mattias Diagl - Low Budget Tooling - Excel-entTEST Huddle

'Continuous Quality Improvements – A Journey Through The Largest Scrum Projec...TEST Huddle

Clive Bates - A Pragmatic Approach to Improving Your Testing Process - EuroST...TEST Huddle

Application of TMMi to improve test approaches and processes: Experience from...Vahid Garousi

Managing Data Science ProjectsDanielle Dean

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWSIconic Translation Machines

Lin li(eric) resume hkEric Li

Talk by Dr. Vahid Garousi, in the Turkey-UK Research Partnerships Event (Feb ...Vahid Garousi

'Acceptance Testing' by Erik BoelenTEST Huddle

Kristian Fischer - Put Test in the Driver's SeatTEST Huddle

Software Development as an Experiment System: A Qualitative Survey on the St...Jürgen Münch

Ruud Teunissen - The Awful Truth About Estimation, Have I Been Wrong All Alon...TEST Huddle

Mi0040 technology managementsmumbahelp

Mats Grindal - Risk-Based Testing - Details of Our Success TEST Huddle

Next level of test automation with Model-based Testing (MBT): Experience and ...Vahid Garousi

Building Blocks for Continuous ExperimentationJürgen Münch

CV hkHon Man Ip

What's hot (19)

Mi0040 technology management

America's Next Top Energy Innovator Challenge

Mattias Diagl - Low Budget Tooling - Excel-ent

'Continuous Quality Improvements – A Journey Through The Largest Scrum Projec...

Clive Bates - A Pragmatic Approach to Improving Your Testing Process - EuroST...

Application of TMMi to improve test approaches and processes: Experience from...

Managing Data Science Projects

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS

Lin li(eric) resume hk

Talk by Dr. Vahid Garousi, in the Turkey-UK Research Partnerships Event (Feb ...

'Acceptance Testing' by Erik Boelen

Kristian Fischer - Put Test in the Driver's Seat

Software Development as an Experiment System: A Qualitative Survey on the St...

Ruud Teunissen - The Awful Truth About Estimation, Have I Been Wrong All Alon...

Mi0040 technology management

Mats Grindal - Risk-Based Testing - Details of Our Success

Next level of test automation with Model-based Testing (MBT): Experience and ...

Building Blocks for Continuous Experimentation

CV hk

Viewers also liked

Innovative Business and Pricing Models: for MTIconic Translation Machines

Rethink Identity and Audience Management - Digital Marketing Presentation at ...StrongView

Arcos napoleonico (2)mgarruchojurado

4.2. burimet e tã drejtes se beNjomza Mjeku

UGB en chiffresDr. Mamadou BA

What your sales team really needs to know about MT (but is afraid to ask?)John Tinsley

Scottdesign Logo Portfolio 2006sdesignme

C.V-ilovepdf-compressedMahmoud Aly

شهادة الهيئة الهندسية م- محمود عليMahmoud Aly

Application Practice on Integrated TM Management solution and TM Sharing, by ...TAUS - The Language Data Network

Machine Translation: The Neural FrontierIconic Translation Machines

Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines

Switching ESPs In A Complex Data EnvironmentMediaPost

Location scouting sheet templaterockinmole

"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines

Lavylites prezentacjawolnosc

MT Evaluation: Seeing the Wood for the TreesIconic Translation Machines

What? Why? How? Factors that impact the success of commercial MT projectsIconic Translation Machines

Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Iconic Translation Machines

Viewers also liked (19)

Innovative Business and Pricing Models: for MT

Rethink Identity and Audience Management - Digital Marketing Presentation at ...

Arcos napoleonico (2)

4.2. burimet e tã drejtes se be

UGB en chiffres

What your sales team really needs to know about MT (but is afraid to ask?)

Scottdesign Logo Portfolio 2006

C.V-ilovepdf-compressed

شهادة الهيئة الهندسية م- محمود علي

Application Practice on Integrated TM Management solution and TM Sharing, by ...

Machine Translation: The Neural Frontier

Past, Present, and Future: Machine Translation & Natural Language Processing ...

Switching ESPs In A Complex Data Environment

Location scouting sheet template

"Machine Translation 101" and the Challenge of Patents

Lavylites prezentacja

MT Evaluation: Seeing the Wood for the Trees

What? Why? How? Factors that impact the success of commercial MT projects

Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...

Similar to From the Lab to the Market: Commercialising MT Research

Improving Translator Productivity with MT: A Patent Translation Case StudyIconic Translation Machines

Beyond Data: Delivering Machine Translation with Subject Matter ExpertiseIconic Translation Machines

Using the power of OpenAI with your own data: what's possible and how to start?Maxim Salnikov

Translating the business needs into an improvement programme using CMM: a pr...Guy Van Hooveld

TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS - The Language Data Network

ConsultingMarkus Voelter

PeopleSoft Testing Made Easy - How To Reduce Your Cost & Not Your HairlineHazelknight Media & Entertainment Pvt Ltd

5 challenges of scaling l10n workflows KantanMT/bmmt webinarkantanmt

Good Applications of Bad Machine Translationbdonaldson

Advanced Data Science and Ai ProgramLearnbay

PCD, Inc.- Partners in Open Innovationlrain1826

DC10 Walter Ganz - keynote - The challenge of testing innovative servicesJaak Vlasveld

Юрій Антонюк: “Modern trend in software services – product development servic...Lviv Startup Club

PCD - Partners in Open Innovation - Sensors and Nanotechnology Specificlrain1826

Manmeet _C14 manmeetchatha

Webinar - Design Thinking for Platform EngineeringOpenCredo

L4MS Webinar - All you need to know (25th Oct)L4MS

Product Lines and Ecosystems: from customization to configurationAdaCore

Data Science at Roche: From Exploration to Productionization - Frank BlockRising Media Ltd.

A systemic routine of thinking for engineersDr. Jose M. Vicente Gomila

Similar to From the Lab to the Market: Commercialising MT Research (20)

Improving Translator Productivity with MT: A Patent Translation Case Study

Beyond Data: Delivering Machine Translation with Subject Matter Expertise

Using the power of OpenAI with your own data: what's possible and how to start?

Translating the business needs into an improvement programme using CMM: a pr...

TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines

Consulting

PeopleSoft Testing Made Easy - How To Reduce Your Cost & Not Your Hairline

5 challenges of scaling l10n workflows KantanMT/bmmt webinar

Good Applications of Bad Machine Translation

Advanced Data Science and Ai Program

PCD, Inc.- Partners in Open Innovation

DC10 Walter Ganz - keynote - The challenge of testing innovative services

Юрій Антонюк: “Modern trend in software services – product development servic...

PCD - Partners in Open Innovation - Sensors and Nanotechnology Specific

Manmeet _C14

Webinar - Design Thinking for Platform Engineering

L4MS Webinar - All you need to know (25th Oct)

Product Lines and Ecosystems: from customization to configuration

Data Science at Roche: From Exploration to Productionization - Frank Block

A systemic routine of thinking for engineers

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Gen AI in Business - Global Trends Report 2024.pdfAddepto

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Pigging Solutions in Pet Food ManufacturingPigging Solutions

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Bluetooth Controlled Car with Arduino.pdfngoud9212

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

CloudStudio User manual (basic edition):comworks

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"ML in Production",Oleksandr BaganFwdays

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Artificial intelligence in the post-deep learning eraDeakin University

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!

Are Multi-Cloud and Serverless Good or Bad?

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Gen AI in Business - Global Trends Report 2024.pdf

My INSURER PTE LTD - Insurtech Innovation Award 2024

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Pigging Solutions in Pet Food Manufacturing

My Hashitalk Indonesia April 2024 Presentation

Bluetooth Controlled Car with Arduino.pdf

Scanning the Internet for External Cloud Exposures via SSL Certs

CloudStudio User manual (basic edition):

Unleash Your Potential - Namagunga Girls Coding Club

Connect Wave/ connectwave Pitch Deck Presentation

"ML in Production",Oleksandr Bagan

APIForce Zurich 5 April Automation LPDG

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Artificial intelligence in the post-deep learning era

Benefits Of Flutter Compared To Other Frameworks

Human Factors of XR: Using Human Factors to Design XR Systems

From the Lab to the Market: Commercialising MT Research

1. “From the Lab to the Market” Commercialising MT Research John Tinsley Director / Co-Founder AMTA. 24th Oct 2014. Vancouver

2. What is Iconic Translation Machines? We provide Machine Translation solutions with Subject Matter Expertise HQ: Dublin, Ireland Founded: 2012

3. For the next 20 minutes… Part 2: The technology that took the journey Part 1: The journey from the lab to the market

4. Innovation in academia Typically Cutting edge techniques not seeing the light of day More Recently Basic research à Applied research New Performance Metrics Industry collaboration, software licenses, spin-outs

5. Lab to Market: The Starting Point…The Lab The Lab Dublin City University The Funding European Union (FP7 PSP) The Goal Adapt existing technology for patent machine translation So what now?

6. Lab to Market: Technical Development The Process Build with working group: release early, release often Engagement Identify broader user base (patent professionals, translators) and field test The Outcome Well-developed, well-performing prototype with a user base Is there a business in this?

7. Lab to Market: Commercial Viability Feasibility Study Grant iHPSU Commercialisation Fund Is this worth exploring? •  Develop MVP •  Business model •  Product/Market fit? •  License IP •  Spin out •  Support •  Exporting •  Investment Skillset

8. Lab to Market: …The Market! Part 2: The technology that took the journey Part 1: The journey from the lab to the market

9. We provide Machine Translation solutions with Subject Matter Expertise

10. An “ensemble” MT architecture built with Linguistic Engineering

11. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data

12. The Challenge of Patents L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words

13. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data

14. Data Engineering + Linguistic Engineering An “ensemble” architecture Chinese pre-ordering rules Statistical Post-editing Input Output Training Data Spanish med-device entity recognizer Multi-output Combination Korean pharma tokenizer Patent input classifier Client TM/terminology (optional) Japanese script normalisation German Compounding rules Moses RBMT Moses Moses

15. De-risking the machine translation proposition What is the value for users? + Data + Time + €€€ = ??? + No data needed + Systems are ready to go + No upfront cost = Evaluate immediately Our PrerequisitesTypical Prerequisites Customisation. Refinement. » Incorporation of user feedback » Incremental training with post-edits » Tuning for specific input types

16. Case Studies 1.  What this approach means straight up in terms of quality… 2.  Productivity gains from using these systems… 3.  As a foundation for client customization…

17. Case 1: Quality 0 5 10 15 20 25 30 35 40 45 50 Iconic Google Systran Portuguese to English

18. Case 1: Quality 2.83 4 3.86 3.56 1 1.5 2 2.5 3 3.5 4 4.5 5 Evaluator 1 Evaluator 2 Evaluator 3 Average German to English TranslationGerman to English

19. Case 2: Productivity Iconic had a domain-specific MT solution for that industry Machine Translation technology for the legal industry Business Need

20. Case 2: Productivity Delivered immediately and initial results were positive Translation samples required for initial evaluation Process

21. Case 2: Productivity >20% productivity increase for translator post-editing Iconic output “MT delivered measurable productivity gains from the outset” Performance

22. Case 3: Customization -  Modify our patent machine translation engines for “Written Opinions” on patents -  0.25% new data, 2 new ensemble processes 21 20 27 0 10 20 30 40 50 60 Iconic Google + Modification Baseline Chinese to English

23. Case 3: Customization Productivity threshold Essentially out of domain – not viable for post-editing

24. Case 3: Customization Productivity threshold After customization – 25% gain in productivity

25. •  Do you have technology that can solve a problem? –  Validate this! •  Is there a market for this technology? –  Find product/market fit! •  Go for it! •  This is what we did in developing domain-adapted MT solutions with subject matter expertise for LSPs and data providers. •  Enjoy the ride! Take home messages… •  Go for it!

26. Thank You! john@iptranslator.com @IconicTrans

Editor's Notes

Make joke about the ‘s’ in “Commercialse” (and localise) and working in the industry and being torn about it!
First things first, what is Iconic Translation Machines and what do we do? We’re a technology provider, we provide MT solutions with subject matter expertise as a cloud-based service. Subject matter expertise means the MT systems have been developed very specifically for certain domains, content types, subject matter, whatever you want to call it. It is a spinout company from the CNGL research centre at Dublin City University in Ireland. We’ve been on the go for around 2 years – the company was founded by myself and Páraic Sheridan - and we’re a team of 6 combining, MT experts, software engineers, and business development. But the technology behind the company has been in development for a lot longer than that. Actually, if you go back through your old AMTA proceedings, you’ll find earlier versions of our work form 2010 in Ottawa and 2012 in San Diego. So, that process of developing the technology within the university is the line along which we’ll start this presentation…
I guess you can look at this talk as being in two parts: the HOW and the WHAT. The first part of this talk is the HOW. How did we actually take the fruits of university research and turn it into a company. It’s going to be slightly autobiographical, because the path that the company has taken is obviously intrinsically linked to my own path over the last 5 years, but I’ll describe the process and what was required for it to happen. Then I’ll talk about the WHAT. What exactly was this technology that we developed. This will give you a bit more detail on what exactly we do at Iconic and our approach to Machine Translation…
The biggest source of innovation in MT is in academia but a lot of it goes unnoticed or just simply isn’t making it out of the various institutions. I broadly attribute this to 2 main causes: - The talent, the researchers, who have these ideas and are making what breakthroughs there are, are researchers because that’s exactly what they want to do – research. They want to work in academia and so, these bright minds with the ideas, are staying within universities and as such the ideas aren’t seeing “the light of day”. There were some insights on why this is and how this landscape is changing somewhat in the panel discussion this more… [say more] The 2nd reason is that often the research isn’t always accessible to outside users, general software engineers or LSPS, who might want to try it themselves (because it’s not designed to be). Also, there’s a huge body of work. How do they know what might be worth trying and what’s exploratory? Who the good research departments are, etc. Those of you who are at this conference are the ones who are on the right track! However, I think in the last years we’re seeing a bit of a SEA CHANGE in this regard. Typically, a large portion of this research has been basic research, but increasingly driven by funding bodies, research is now leaning towards being more “applied” in nature, whereby areas of more broader interest to the industry are targeted. Research topics are being identified based on real-life challenges perhaps highlighted by a collaborating partner and where the aims of the research have a more practical outcome (though the funding agencies, often governments, have different motivations for this [JOBS] which I’ll get to) In Ireland, where we’ve come from, it’s almost a requirement, as opposed to something desirable. When applying for funding, yes there are regular metrics like publications and graduates, but also new requirements like industry collaboration (and supporting funding) as well as software licenses and spiout companies. This has lead to the advent of the commercial development manager within research departments to foster this. Universities have Technology Transfer Offices to facilitate it. The result is more academic innovations are making it out AND the results are filtering because these new structures, like CDMs are only putting out the good stuff (or the stuff that worked)
European funding agencies are following suit and that starts our story. For those of you who don’t know me, I completed a Phd on syntax-based approaches to Machine Translation at Dublin City University in 2009 (just pre-CNGL). Following that, with CNGL, as part of an industry-academia consortium we received funding from the EU to explore the application of existing research outcomes (in this case, machine translation research from DCU) to an existing issue (in this case, the translation of patents). This was an R&D project and as such, to ensure focus, funding was actually structured so that the EU only 50% matches what is spent by all partners. So it’s not just a case of take the money and just do whatever research you want. It makes the team focus on developing some of value that can give them return. Actually, there are specific commercial deliverables in these types of projects such as licenses and joint venture between the partners. In order to achieve this goal, there was a combination of applied research and research and development to determine: HOW to adapt the existing technology, i.e. what we needed to do from a technical perspective Build prototypes and ENGAGE user, stakeholders etc. from an early stage to make sure that a) we’re developing something that people want and b) to make sure it’s something useable Kinda like the “ship early ship often” model of lean software development but without the immediate commercial pressure. So an ideal scenario!
To cut a long story short, we develop a product which comes to be known as IPTranslator. We do this by developing MT systems and the vehicles for delivering MT in tandem, which working with a working group of users to testing quality, accessibility, and usability. User engagement is crucial and it’s not just a feature of commercializing research. It applies in all areas where you’re building something for someone. We no longer live by the “if we build it they will come” mantra. Then engage a broader user base, exploiting the position we’re in (not scaring people off by being “salesy” which was a real plus) while still being able to toy with potential pricing model and business models. So they outcome is a tool that works well, that people like, that people use, but with no real clear picture as to whether there’s a business model that will work.
At this stage in the journey, there’s an awful lot of support available within Ireland. Enterprise Ireland are a government agency tasks to help indigenous business, - startups, SMEs, and larger – to get off the ground, to grow and export (as opposed, for example, to the IDA who you might be familiar with who look after foreign direct investment). For new businesses, they have various mechanisms and programs for the man on the street who has an idea to 3rd level researchers with technology. We’re obviously the latter so we started to engage with them in the middle of 2012. The first step is a small feasibility grant, €10-15,000 to explore the market potential of the business and conduct a “go to market” study. We did this and determined there was an opportunity to explore so the next step was EI’s Commercialisation Fund program which, can take various forms, but for us involved bringing our technology up to production ready standards, e.g. off university servers and outside, security, scalability, etc. Get more people on board with a more commercial focus, look and start to explore sustainable business models for the technology, things like: - Target markets, customer segments Sales channels, markets, etc etc Competitor analysis When we finally struck on something we thought was going to work, then it’s time to fly the coop! This involves negotiating a deal with the university TTO to license the technology and then get down to it with the practicalities of running a business. Once you’ve done this within the Enterprise Ireland ecosystem, you become what’s know as a HPSU or High Potential Start Up (as opposed to local business they also support, which aren’t necessarily focused on rapid growth, but rather sustainability). With this, the provide a lot of support from subsidizing courseson anything you can imagine when it comes to running a business, to helping you prepare for export through contacts, local offices in other countries, and trade missions. As well as helping companies to become investor ready as well as participating in funding rounds themselves. **[Skillset transition, product development/market development, can you do this? Research centres need to support this, TTOs need to make sure that companies have this, thye can provide some support, EI can provide business partners]** **[It’s a completely different skillset than that of the researcher]**
So that details the formation of Iconic Translation Machines and brings us up to the present day. I’ll now spend a little bit of time talking about what we have developed. From the technical side, I’m not going to talk about how we ended up here (that’s well documented elsewhere) but rather I’m going to focus on what we ended up WITH.
If you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content. The same applies to MT technology (and its providers)
As a developer, you cannot be dogmatic when it comes to approaches to MT. We’re not a statistical MT vendor, we don’t focus on Moses, we’re not a rule-based MT vendor. We don’t do hybrid MT. We do all of them. We call this an “ensemble” approach. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together. e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes. We study. We learn what ensemble works for a particular configuration and that’s what we implement.
Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. By being completely reliant on the data, We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for.
But of course it’s not just that easy. Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software. Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim). Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. By being completely reliant on the data, We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for.
Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever. Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?... Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data. That presents a challenge when the training data changes for each system that’s built. I’ll come back to this later…
Obviously one of the issues in adopting machine translation technology is the risk that’s involved. You invest in a program, it doesn’t deliver straight away, it might start brining you returns but when? How long? If ever If we look specifically at the approach we’ve taken: Our proposition helps to derisk the adoption Typical setup involves: data, across all languages. How much do you have? Is that enough? Is it clean? Is it yours to give away? Time, how long is development going to take? Will MT be good enough straight away after that? If not, when? What’s the upfront cost for customisation or subscription to the service?
That’s the value for the users for the whole concept, but what if we get down to the nuts and bolts of it and talk about the value in terms of the returns…what does using this type of MT get you? To give an illustration, I’ll run through 3 quick examples and case studies from our own experiences. The first of which will look at what this does in terms of straight up quality of the MT output After that, no pun intended, we’ll see how that translates to productivity when post-editing the output Finally, we’ll look at what you can do when you have these systems built and ready to go in terms of customisation, with minimal effort..
All of these examples are using our IPTranslator systems which have been developed for patent machine translation. First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option. These results correlated well with human assessment of adequacy, another of which we can look at here…
For our German to English system, we had 3 evaluators look at around 400 segments each and rank them from 1-5 in terms of how adequately the carried the meaning from the source to the target. Typically, a score of 3 or high indications the the segments are “usable” – i.e. readable and understandable So they’re just a couple of brief examples to show that this approach is developing systems that can produce good quality output, without the need for additional adaption for each individual user. I want to now look at a case study that illustrates how these systems, as they are, with these levels of quality, can produce output that leads to more productive post-editing…
This is a case with another of our clients who have a substantial patent translation business. They had a slightly different need in that, rather than the translation of patent documents themselves, they wanted to translation what are known as Written Opinions, essentially reports from patent examiners about the validity of a patent application. From an MT perspective, when a lot of the technical terminology is the same, the register is completely different. These written opinions contain first person, questions, opinions – sentence structures and words that just aren’t in patents and consequently not in our original systems. If we looked at how our systems performed when trying to handle this, we get a BLEU score of around 21 where Google, a system designed for whatever’s thrown at it, gets a score of 20 – so around the same. What we need to do is modify these systems for this particular type of text. What we had at hand to do this was some TMs from our client, not much though, it amounted of around 0.25% of the amount of data we’d trained our original engines with. We also developed a couple of processes to add to our ensemble architecture to handle specifics of these Reports, such as consistent references to PCT (patent cooperation treaty) Regulations. This resulted in the performance more than doubling….
In terms of how this correlated into post-editing productivity for the client, well let’s look at this scatter plot. Each dot is a segment in our test. Along the horizontal axis we have the length of the segment in words. On the vertical axis we have a proprietary score that correlates with post-editing productivity whereby a score of 0.4 means, roughly, there’ll be some productivity from post-editing. Above means most likely not, and the lower, the less editing is required. So here we can see that only a small portion of the segments fall below the threshold so, basically, the document (which is essentially out of domain) is NOT VIABLE for this MT system.
However, AFTER we do the customisation we see that a large number of the segments drop below the line, a bit over 60% of them, with quite a few hitting the 0 score also. When we run the number of these, they lead to productivity gains of around 25%
Your core technology will be the same but how it looks, how it’s used and by whom can changes radically! FROM: Patent searches, web interface, read foreign patents in their language TO: professional translations/LSPs, via API and CAT tool connectors, translates patents into foreign languages faster It’s hard work but it’s rewarding!

From the Lab to the Market: Commercialising MT Research

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (19)

Similar to From the Lab to the Market: Commercialising MT Research

Similar to From the Lab to the Market: Commercialising MT Research (20)

Recently uploaded

Recently uploaded (20)

From the Lab to the Market: Commercialising MT Research

Editor's Notes