Presentation ASLIB 2014_Ghoula

Terminology Management Revisited
Nizar Ghoula1,2, Jacques Guyot1 and Gilles Falquet1,2
1 The Olanto Foundation 10 Chemin de Champ-Claude1214 Vernier (Geneva) Switzerland
jacques@olanto.org, nizar@olanto.org
2 University of Geneva Centre Universitaire d’Informatique
Gilles.falquet@unige.ch
olanto.org

Outline
• Olanto – Presentation
• Terminology management –> Glossary (lexicon) composition
• Validate term derivations –> Using TMX resources
• Correlation Function –> Results
• Using TMX –> Infer a terminological translation
• Discussion
• Annex

Olanto
• Olanto is a not-for-profit Swiss Foundation (free software)
• Computer-Assisted Translation (CAT)
• Machine Translation (MT)
• Multilingual search
• 4 integrators in Switzerland
• SimpleShift
• Answer
• Neurones
• University of Geneva

Olanto
• The foundation is open to:
• Translators, terminologists, computer scientists, researchers CAT
• 3 software distributions and more are under development:
• myCAT: a concordancer, i.e. a full-text search engine which, in addition to
showing the relevant documents, also shows their translation.
• myPREP: a text aligner software, a tool which makes possible to automatically
align two by two the documents in a multilingual corpus.
• myMT: an automatic translation tool based on Moses (statistical translation)

myTERM – Terminology Management
• Based on the TBX formalism
• Supports multiple models
• Prefix
• Column (positions)
• Etc...
• Compatible with
multiple systems
• Easy to install• Open for other tools
(Web Service)

myTERM – Terminology Management

Glossary (lexicon) composition
• Compose words associations by transitivity:
e.g. EN -> FR and FR -> DE  Derive an association FR -> DE
• Polysemy problems:
e.g. Acte-> Act;
Act-> Handlung
Act-> Gesetz
 Acte-> Handlung
 Acte-> Gesetz
• How to remove wrong associations « Chimera »?
Use examples of aligned translated sentences to remove “chimera”s

Correlation measure
Count only the 0 and 1:
• Count sentences where the terms
appear
• Count the intersection

Experiments: transitivity
acte act Handlung
acte act Gesetz
loi act Handlung
loi act Gesetz
agir act Handlung
agir act Gesetz

Experiments: observation
acte Handlung 8937 703 431 0.171933832
acte Gesetz 8937 2678 52 0.010580932
agir Handlung 1779 703 14 0.012507763
agir Gesetz 1779 2678 1 4.36E-04
loi Gesetz 8844 2678 2412 0.49559854
loi Recht 8844 10000 851 0.090405434
loi Handlung 8844 703 14 0.005590025
FR DE n1 n2 n12 Correlation

Implementation http://88.127.130.21:4444/TermsCorrelation/

Implementation
• Align/ Index/ Correlation
• Corpus
• DGT 2014
• 22 languages
• 85 Mo sentences
• MULTI-UN2
• 7 languages
• 69 Mo sentences
• Etc…
Corpus1
Corpus2
Corpusk
myPREP
Convert, Align
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
myCAT
Index & Map
WebServer
How2Say
Client-GUI
(IE, Firefox,
Safari,...)
Query
Response
Correlation
Measure
Glo1 Glo2
Transitive generator
&
Validator
myTerm
TBX ﬁle

Experiment 1: Corpus coverage (FR-EN) For Wikitionary 2008
• Corpora dependent
• Coverage issues
• Terminological signature
• Maximum aggregation extends the coverage

Experiment 2: Transitivity and filtering
• Use only the known parts of the
dictionaries
• Remove associations with low correlation
• Wrong -> chimera
• Use complete dictionaries
• Remove associations with low correlation
• Wrong -> Out of corpora or chimera?
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
0 0.05 0.1 0.15 0.2 0.25
min01 DGT
no filter DGT
no filter
EUBOOK
correlation vs precision

• By filtering candidates for transitivity before
applying the transitivity
• The quality did not improve
• Worse: we censured many correct term
associations
-> filtering before transitivity is not a good idea
after all
0
1000
2000
3000
4000
5000
6000
7000
0 0.1 0.2 0.3
pos DGT nofilter
pos min01 DGT
Positions of
correlation
intervals

• Correlation is useful for filtering wrong associations
• BUT we are not sure of having the right translation (depends on the corpus’s
coverage)
• E.g. Juridical Dictionary (UNOG 2000): not covered by the corpus MULTI-UN
• Idea: infer the most cited n-grams and calculate their correlations
 How2Say

• Condition
• Term frequency > 1 (better > 10)
• Supports and generates all languages associations
• DGT-2014: 22 Languages (462 associations)
• No need for a translation model (vs SMT, Moses, etc.)
• Retrieves the most frequent target expression based on the corpus
• Displays examples for the association (context)
• No need for scanning many documents to be sure about the association
How2Say

• Integrate myTerm and How2Say
o Parse and create interactively terminology with the documentation of an organization
o Automatically add valid words associations to myTerm’s repository
• Integrate How2Say and myCAT
o How2Say -> myCAT: retrieve documents for an example
o myCAT ->How2Say: retrieve statistics for an expression
• Integrate How2Say and mySearch
o Multilingual retrieval (mySearch) uses automatic (myMT)
o Queries are expressed using terms -> use How2Say to find associated terms in other languages
(within the same corpora)
• … and more ideas…!
All suggestions are welcome!
Thank you for your attention
New applications that will be proposed by Olanto

Corpus
Document
Sentence
Term
ID
X
VL
I
myTerm
myMT
myPREP
myCAT
How2Say
BITEXT
Focus Problem?

The
organization’s
corpora, its
terminology and
itsstyle

myCAT
• Voluminous corpora (OMC 500’000, 3 languages) (Demo Olanto 1’000’000, 27
languages)
• Easy to use (easier to form users)
• Multi OS (Windows 7, Windows 2008 R2, GNU/Linux (Ubuntu 12.04 LTS)
• Multiple platforms (IE 8,9,10 – Safari – Firefox - Chrome)
• Multi Formats- Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …)
• Multilingual interface (EN, FR, ES, AR, RU, …)
• Robust (no reboot for months), resources economy (100ms/query)
• WebService integration
• Concordancer (search exact expressions/Fuzzy, sentence alignment, filtering by collection,
display/save original, search by file name, …)
• Referencing (retrieve expressions already translated within the corpora, filtering by collections,
display/saving the references, statistics, …)
• Auto-referencing (retrieve expressions that redundant in documents, display/save, statistics, …)

myMT (automatic translation)
• Adapted to the terminology and to the style of the domain
• Training and evaluation phases
• Multi Platforms (IE 8,9,10 – Safari – Firefox - Chrome)
• Multi Formats - Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …)
• Multilingual interface (EN, FR, ES, AR, RU, …)
• Adaptable to the client’s needs
• Scalable (by cloning translation nodes)
• Robust (Redundancy and auto-reparable)
• WebService integration
• Automatic translation
• Choose the translation automata (corpus, languages)
• Multiple formats
• Preserve formatting and style for documents
• Send final result by email

Why a foundation?
• Open source software are accessible for all
• No more paying licences  The client has a non limited usage
• Open for service companies (integration, installation, maintenance)
• The client pays only services, usage is unlimited
• The client contributes at enhancing the software
• All the community benefits from the contributions
• The sustainability of the software is independent from the company
• Grants help to set up new projects
• Easy collaboration with institutions (Universities, other foundations, …)

Presentation ASLIB 2014_Ghoula

Recommended

Recommended

More Related Content

Similar to Presentation ASLIB 2014_Ghoula

Similar to Presentation ASLIB 2014_Ghoula (20)

Presentation ASLIB 2014_Ghoula

Editor's Notes