1. Terminology Management Revisited
Nizar Ghoula1,2, Jacques Guyot1 and Gilles Falquet1,2
1 The Olanto Foundation 10 Chemin de Champ-Claude1214 Vernier (Geneva) Switzerland
jacques@olanto.org, nizar@olanto.org
2 University of Geneva Centre Universitaire d’Informatique
Gilles.falquet@unige.ch
olanto.org
2. Outline
• Olanto – Presentation
• Terminology management –> Glossary (lexicon) composition
• Validate term derivations –> Using TMX resources
• Correlation Function –> Results
• Using TMX –> Infer a terminological translation
• Discussion
• Annex
3. Olanto
• Olanto is a not-for-profit Swiss Foundation (free software)
• Computer-Assisted Translation (CAT)
• Machine Translation (MT)
• Multilingual search
• 4 integrators in Switzerland
• SimpleShift
• Answer
• Neurones
• University of Geneva
4. Olanto
• The foundation is open to:
• Translators, terminologists, computer scientists, researchers CAT
• 3 software distributions and more are under development:
• myCAT: a concordancer, i.e. a full-text search engine which, in addition to
showing the relevant documents, also shows their translation.
• myPREP: a text aligner software, a tool which makes possible to automatically
align two by two the documents in a multilingual corpus.
• myMT: an automatic translation tool based on Moses (statistical translation)
5. myTERM – Terminology Management
• Based on the TBX formalism
• Supports multiple models
• Prefix
• Column (positions)
• Etc...
• Compatible with
multiple systems
• Easy to install• Open for other tools
(Web Service)
7. Glossary (lexicon) composition
• Compose words associations by transitivity:
e.g. EN -> FR and FR -> DE Derive an association FR -> DE
• Polysemy problems:
e.g. Acte-> Act;
Act-> Handlung
Act-> Gesetz
Acte-> Handlung
Acte-> Gesetz
• How to remove wrong associations « Chimera »?
Use examples of aligned translated sentences to remove “chimera”s
12. Implementation
• Align/ Index/ Correlation
• Corpus
• DGT 2014
• 22 languages
• 85 Mo sentences
• MULTI-UN2
• 7 languages
• 69 Mo sentences
• Etc…
Corpus1
Corpus2
Corpusk
myPREP
Convert, Align
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
myCAT
Index & Map
WebServer
How2Say
Client-GUI
(IE, Firefox,
Safari,...)
Query
Response
Correlation
Measure
Glo1 Glo2
Transitive generator
&
Validator
myTerm
TBX file
13. Experiment 1: Corpus coverage (FR-EN) For Wikitionary 2008
• Corpora dependent
• Coverage issues
• Terminological signature
• Maximum aggregation extends the coverage
14. Experiment 2: Transitivity and filtering
• Use only the known parts of the
dictionaries
• Remove associations with low correlation
• Wrong -> chimera
• Use complete dictionaries
• Remove associations with low correlation
• Wrong -> Out of corpora or chimera?
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
0 0.05 0.1 0.15 0.2 0.25
min01 DGT
no filter DGT
no filter
EUBOOK
correlation vs precision
15. Experiments: observation
• By filtering candidates for transitivity before
applying the transitivity
• The quality did not improve
• Worse: we censured many correct term
associations
-> filtering before transitivity is not a good idea
after all
0
1000
2000
3000
4000
5000
6000
7000
0 0.1 0.2 0.3
pos DGT nofilter
pos min01 DGT
Positions of
correlation
intervals
16. Experiments: observation
• Correlation is useful for filtering wrong associations
• BUT we are not sure of having the right translation (depends on the corpus’s
coverage)
• E.g. Juridical Dictionary (UNOG 2000): not covered by the corpus MULTI-UN
• Idea: infer the most cited n-grams and calculate their correlations
How2Say
19. • Condition
• Term frequency > 1 (better > 10)
• Supports and generates all languages associations
• DGT-2014: 22 Languages (462 associations)
• No need for a translation model (vs SMT, Moses, etc.)
• Retrieves the most frequent target expression based on the corpus
• Displays examples for the association (context)
• No need for scanning many documents to be sure about the association
How2Say
20. • Integrate myTerm and How2Say
o Parse and create interactively terminology with the documentation of an organization
o Automatically add valid words associations to myTerm’s repository
• Integrate How2Say and myCAT
o How2Say -> myCAT: retrieve documents for an example
o myCAT ->How2Say: retrieve statistics for an expression
• Integrate How2Say and mySearch
o Multilingual retrieval (mySearch) uses automatic (myMT)
o Queries are expressed using terms -> use How2Say to find associated terms in other languages
(within the same corpora)
• … and more ideas…!
All suggestions are welcome!
Thank you for your attention
New applications that will be proposed by Olanto
27. myMT (automatic translation)
• Adapted to the terminology and to the style of the domain
• Training and evaluation phases
• Multi Platforms (IE 8,9,10 – Safari – Firefox - Chrome)
• Multi Formats - Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …)
• Multilingual interface (EN, FR, ES, AR, RU, …)
• Adaptable to the client’s needs
• Scalable (by cloning translation nodes)
• Robust (Redundancy and auto-reparable)
• WebService integration
• Automatic translation
• Choose the translation automata (corpus, languages)
• Multiple formats
• Preserve formatting and style for documents
• Send final result by email
28.
29. Why a foundation?
• Open source software are accessible for all
• No more paying licences The client has a non limited usage
• Open for service companies (integration, installation, maintenance)
• The client pays only services, usage is unlimited
• The client contributes at enhancing the software
• All the community benefits from the contributions
• The sustainability of the software is independent from the company
• Grants help to set up new projects
• Easy collaboration with institutions (Universities, other foundations, …)
Editor's Notes
aims at creating and distributing free software in the field of
Open source software are accessible for all
The client pays only services, usage is unlimited
The client contributes at enhancing the software
The sustainability of the software is independent from the company
Easy collaboration with institutions (Universities, other foundations, …)