SlideShare a Scribd company logo
1 of 29
Terminology Management Revisited
Nizar Ghoula1,2, Jacques Guyot1 and Gilles Falquet1,2
1 The Olanto Foundation 10 Chemin de Champ-Claude1214 Vernier (Geneva) Switzerland
jacques@olanto.org, nizar@olanto.org
2 University of Geneva Centre Universitaire d’Informatique
Gilles.falquet@unige.ch
olanto.org
Outline
• Olanto – Presentation
• Terminology management –> Glossary (lexicon) composition
• Validate term derivations –> Using TMX resources
• Correlation Function –> Results
• Using TMX –> Infer a terminological translation
• Discussion
• Annex
Olanto
• Olanto is a not-for-profit Swiss Foundation (free software)
• Computer-Assisted Translation (CAT)
• Machine Translation (MT)
• Multilingual search
• 4 integrators in Switzerland
• SimpleShift
• Answer
• Neurones
• University of Geneva
Olanto
• The foundation is open to:
• Translators, terminologists, computer scientists, researchers CAT
• 3 software distributions and more are under development:
• myCAT: a concordancer, i.e. a full-text search engine which, in addition to
showing the relevant documents, also shows their translation.
• myPREP: a text aligner software, a tool which makes possible to automatically
align two by two the documents in a multilingual corpus.
• myMT: an automatic translation tool based on Moses (statistical translation)
myTERM – Terminology Management
• Based on the TBX formalism
• Supports multiple models
• Prefix
• Column (positions)
• Etc...
• Compatible with
multiple systems
• Easy to install• Open for other tools
(Web Service)
myTERM – Terminology Management
Glossary (lexicon) composition
• Compose words associations by transitivity:
e.g. EN -> FR and FR -> DE  Derive an association FR -> DE
• Polysemy problems:
e.g. Acte-> Act;
Act-> Handlung
Act-> Gesetz
 Acte-> Handlung
 Acte-> Gesetz
• How to remove wrong associations « Chimera »?
Use examples of aligned translated sentences to remove “chimera”s
Correlation measure
Count only the 0 and 1:
• Count sentences where the terms
appear
• Count the intersection
Experiments: transitivity
acte act Handlung
acte act Gesetz
loi act Handlung
loi act Gesetz
agir act Handlung
agir act Gesetz
Experiments: observation
acte Handlung 8937 703 431 0.171933832
acte Gesetz 8937 2678 52 0.010580932
agir Handlung 1779 703 14 0.012507763
agir Gesetz 1779 2678 1 4.36E-04
loi Gesetz 8844 2678 2412 0.49559854
loi Recht 8844 10000 851 0.090405434
loi Handlung 8844 703 14 0.005590025
FR DE n1 n2 n12 Correlation
Implementation http://88.127.130.21:4444/TermsCorrelation/
Implementation
• Align/ Index/ Correlation
• Corpus
• DGT 2014
• 22 languages
• 85 Mo sentences
• MULTI-UN2
• 7 languages
• 69 Mo sentences
• Etc…
Corpus1
Corpus2
Corpusk
myPREP
Convert, Align
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
T
M
X
myCAT
Index & Map
WebServer
How2Say
Client-GUI
(IE, Firefox,
Safari,...)
Query
Response
Correlation
Measure
Glo1 Glo2
Transitive generator
&
Validator
myTerm
TBX file
Experiment 1: Corpus coverage (FR-EN) For Wikitionary 2008
• Corpora dependent
• Coverage issues
• Terminological signature
• Maximum aggregation extends the coverage
Experiment 2: Transitivity and filtering
• Use only the known parts of the
dictionaries
• Remove associations with low correlation
• Wrong -> chimera
• Use complete dictionaries
• Remove associations with low correlation
• Wrong -> Out of corpora or chimera?
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
0 0.05 0.1 0.15 0.2 0.25
min01 DGT
no filter DGT
no filter
EUBOOK
correlation vs precision
Experiments: observation
• By filtering candidates for transitivity before
applying the transitivity
• The quality did not improve
• Worse: we censured many correct term
associations
-> filtering before transitivity is not a good idea
after all
0
1000
2000
3000
4000
5000
6000
7000
0 0.1 0.2 0.3
pos DGT nofilter
pos min01 DGT
Positions of
correlation
intervals
Experiments: observation
• Correlation is useful for filtering wrong associations
• BUT we are not sure of having the right translation (depends on the corpus’s
coverage)
• E.g. Juridical Dictionary (UNOG 2000): not covered by the corpus MULTI-UN
• Idea: infer the most cited n-grams and calculate their correlations
 How2Say
How2Say: DGT-2014
How2Say: MULTI-UN
• Condition
• Term frequency > 1 (better > 10)
• Supports and generates all languages associations
• DGT-2014: 22 Languages (462 associations)
• No need for a translation model (vs SMT, Moses, etc.)
• Retrieves the most frequent target expression based on the corpus
• Displays examples for the association (context)
• No need for scanning many documents to be sure about the association
How2Say
• Integrate myTerm and How2Say
o Parse and create interactively terminology with the documentation of an organization
o Automatically add valid words associations to myTerm’s repository
• Integrate How2Say and myCAT
o How2Say -> myCAT: retrieve documents for an example
o myCAT ->How2Say: retrieve statistics for an expression
• Integrate How2Say and mySearch
o Multilingual retrieval (mySearch) uses automatic (myMT)
o Queries are expressed using terms -> use How2Say to find associated terms in other languages
(within the same corpora)
• … and more ideas…!
All suggestions are welcome!
Thank you for your attention
New applications that will be proposed by Olanto
Corpus
Document
Sentence
Term
ID
X
VL
I
myTerm
myMT
myPREP
myCAT
How2Say
BITEXT
Focus Problem?
The
organization’s
corpora, its
terminology and
itsstyle
myCAT
• Voluminous corpora (OMC 500’000, 3 languages) (Demo Olanto 1’000’000, 27
languages)
• Easy to use (easier to form users)
• Multi OS (Windows 7, Windows 2008 R2, GNU/Linux (Ubuntu 12.04 LTS)
• Multiple platforms (IE 8,9,10 – Safari – Firefox - Chrome)
• Multi Formats- Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …)
• Multilingual interface (EN, FR, ES, AR, RU, …)
• Robust (no reboot for months), resources economy (100ms/query)
• WebService integration
• Concordancer (search exact expressions/Fuzzy, sentence alignment, filtering by collection,
display/save original, search by file name, …)
• Referencing (retrieve expressions already translated within the corpora, filtering by collections,
display/saving the references, statistics, …)
• Auto-referencing (retrieve expressions that redundant in documents, display/save, statistics, …)
Concordancer
Referencing
Auto referencing
myMT (automatic translation)
• Adapted to the terminology and to the style of the domain
• Training and evaluation phases
• Multi Platforms (IE 8,9,10 – Safari – Firefox - Chrome)
• Multi Formats - Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …)
• Multilingual interface (EN, FR, ES, AR, RU, …)
• Adaptable to the client’s needs
• Scalable (by cloning translation nodes)
• Robust (Redundancy and auto-reparable)
• WebService integration
• Automatic translation
• Choose the translation automata (corpus, languages)
• Multiple formats
• Preserve formatting and style for documents
• Send final result by email
Why a foundation?
• Open source software are accessible for all
• No more paying licences  The client has a non limited usage
• Open for service companies (integration, installation, maintenance)
• The client pays only services, usage is unlimited
• The client contributes at enhancing the software
• All the community benefits from the contributions
• The sustainability of the software is independent from the company
• Grants help to set up new projects
• Easy collaboration with institutions (Universities, other foundations, …)

More Related Content

Similar to Presentation ASLIB 2014_Ghoula

Lean and Collaborative Content - Workshop
Lean and Collaborative Content - WorkshopLean and Collaborative Content - Workshop
Lean and Collaborative Content - WorkshopIXIASOFT
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Andrea Matsunaga
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
 
Session 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda KochSession 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda Kochlocloud
 
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...TAUS - The Language Data Network
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Kevin Dias
 
Information Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampInformation Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampWim Peters
 
KOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyKOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyVassilis Protonotarios
 
Translation technology plugging the gaps_ecpd
Translation technology plugging the gaps_ecpdTranslation technology plugging the gaps_ecpd
Translation technology plugging the gaps_ecpdLucinda Brooks
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Information Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataInformation Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataWim Peters
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31Dag Endresen
 
Smart cities no ai without ia
Smart cities   no ai without iaSmart cities   no ai without ia
Smart cities no ai without iaFredric Landqvist
 

Similar to Presentation ASLIB 2014_Ghoula (20)

Lean and Collaborative Content - Workshop
Lean and Collaborative Content - WorkshopLean and Collaborative Content - Workshop
Lean and Collaborative Content - Workshop
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
Session 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda KochSession 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda Koch
 
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...
TaaS Workshop 2014, Term Mining and Terminology Management in a Corporate Set...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
 
Taxonomy Interoperability Standards
Taxonomy Interoperability StandardsTaxonomy Interoperability Standards
Taxonomy Interoperability Standards
 
Information Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative CampInformation Extraction in the TalkOfEurope Creative Camp
Information Extraction in the TalkOfEurope Creative Camp
 
KOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyKOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet Ontology
 
Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...
 
Translation technology plugging the gaps_ecpd
Translation technology plugging the gaps_ecpdTranslation technology plugging the gaps_ecpd
Translation technology plugging the gaps_ecpd
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Information Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataInformation Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament data
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31
 
The Tipping Point
The Tipping PointThe Tipping Point
The Tipping Point
 
The tipping point
The tipping pointThe tipping point
The tipping point
 
Smart cities no ai without ia
Smart cities   no ai without iaSmart cities   no ai without ia
Smart cities no ai without ia
 

Presentation ASLIB 2014_Ghoula

  • 1. Terminology Management Revisited Nizar Ghoula1,2, Jacques Guyot1 and Gilles Falquet1,2 1 The Olanto Foundation 10 Chemin de Champ-Claude1214 Vernier (Geneva) Switzerland jacques@olanto.org, nizar@olanto.org 2 University of Geneva Centre Universitaire d’Informatique Gilles.falquet@unige.ch olanto.org
  • 2. Outline • Olanto – Presentation • Terminology management –> Glossary (lexicon) composition • Validate term derivations –> Using TMX resources • Correlation Function –> Results • Using TMX –> Infer a terminological translation • Discussion • Annex
  • 3. Olanto • Olanto is a not-for-profit Swiss Foundation (free software) • Computer-Assisted Translation (CAT) • Machine Translation (MT) • Multilingual search • 4 integrators in Switzerland • SimpleShift • Answer • Neurones • University of Geneva
  • 4. Olanto • The foundation is open to: • Translators, terminologists, computer scientists, researchers CAT • 3 software distributions and more are under development: • myCAT: a concordancer, i.e. a full-text search engine which, in addition to showing the relevant documents, also shows their translation. • myPREP: a text aligner software, a tool which makes possible to automatically align two by two the documents in a multilingual corpus. • myMT: an automatic translation tool based on Moses (statistical translation)
  • 5. myTERM – Terminology Management • Based on the TBX formalism • Supports multiple models • Prefix • Column (positions) • Etc... • Compatible with multiple systems • Easy to install• Open for other tools (Web Service)
  • 7. Glossary (lexicon) composition • Compose words associations by transitivity: e.g. EN -> FR and FR -> DE  Derive an association FR -> DE • Polysemy problems: e.g. Acte-> Act; Act-> Handlung Act-> Gesetz  Acte-> Handlung  Acte-> Gesetz • How to remove wrong associations « Chimera »? Use examples of aligned translated sentences to remove “chimera”s
  • 8. Correlation measure Count only the 0 and 1: • Count sentences where the terms appear • Count the intersection
  • 9. Experiments: transitivity acte act Handlung acte act Gesetz loi act Handlung loi act Gesetz agir act Handlung agir act Gesetz
  • 10. Experiments: observation acte Handlung 8937 703 431 0.171933832 acte Gesetz 8937 2678 52 0.010580932 agir Handlung 1779 703 14 0.012507763 agir Gesetz 1779 2678 1 4.36E-04 loi Gesetz 8844 2678 2412 0.49559854 loi Recht 8844 10000 851 0.090405434 loi Handlung 8844 703 14 0.005590025 FR DE n1 n2 n12 Correlation
  • 12. Implementation • Align/ Index/ Correlation • Corpus • DGT 2014 • 22 languages • 85 Mo sentences • MULTI-UN2 • 7 languages • 69 Mo sentences • Etc… Corpus1 Corpus2 Corpusk myPREP Convert, Align T M X T M X T M X T M X T M X T M X myCAT Index & Map WebServer How2Say Client-GUI (IE, Firefox, Safari,...) Query Response Correlation Measure Glo1 Glo2 Transitive generator & Validator myTerm TBX file
  • 13. Experiment 1: Corpus coverage (FR-EN) For Wikitionary 2008 • Corpora dependent • Coverage issues • Terminological signature • Maximum aggregation extends the coverage
  • 14. Experiment 2: Transitivity and filtering • Use only the known parts of the dictionaries • Remove associations with low correlation • Wrong -> chimera • Use complete dictionaries • Remove associations with low correlation • Wrong -> Out of corpora or chimera? 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 0 0.05 0.1 0.15 0.2 0.25 min01 DGT no filter DGT no filter EUBOOK correlation vs precision
  • 15. Experiments: observation • By filtering candidates for transitivity before applying the transitivity • The quality did not improve • Worse: we censured many correct term associations -> filtering before transitivity is not a good idea after all 0 1000 2000 3000 4000 5000 6000 7000 0 0.1 0.2 0.3 pos DGT nofilter pos min01 DGT Positions of correlation intervals
  • 16. Experiments: observation • Correlation is useful for filtering wrong associations • BUT we are not sure of having the right translation (depends on the corpus’s coverage) • E.g. Juridical Dictionary (UNOG 2000): not covered by the corpus MULTI-UN • Idea: infer the most cited n-grams and calculate their correlations  How2Say
  • 19. • Condition • Term frequency > 1 (better > 10) • Supports and generates all languages associations • DGT-2014: 22 Languages (462 associations) • No need for a translation model (vs SMT, Moses, etc.) • Retrieves the most frequent target expression based on the corpus • Displays examples for the association (context) • No need for scanning many documents to be sure about the association How2Say
  • 20. • Integrate myTerm and How2Say o Parse and create interactively terminology with the documentation of an organization o Automatically add valid words associations to myTerm’s repository • Integrate How2Say and myCAT o How2Say -> myCAT: retrieve documents for an example o myCAT ->How2Say: retrieve statistics for an expression • Integrate How2Say and mySearch o Multilingual retrieval (mySearch) uses automatic (myMT) o Queries are expressed using terms -> use How2Say to find associated terms in other languages (within the same corpora) • … and more ideas…! All suggestions are welcome! Thank you for your attention New applications that will be proposed by Olanto
  • 23. myCAT • Voluminous corpora (OMC 500’000, 3 languages) (Demo Olanto 1’000’000, 27 languages) • Easy to use (easier to form users) • Multi OS (Windows 7, Windows 2008 R2, GNU/Linux (Ubuntu 12.04 LTS) • Multiple platforms (IE 8,9,10 – Safari – Firefox - Chrome) • Multi Formats- Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …) • Multilingual interface (EN, FR, ES, AR, RU, …) • Robust (no reboot for months), resources economy (100ms/query) • WebService integration • Concordancer (search exact expressions/Fuzzy, sentence alignment, filtering by collection, display/save original, search by file name, …) • Referencing (retrieve expressions already translated within the corpora, filtering by collections, display/saving the references, statistics, …) • Auto-referencing (retrieve expressions that redundant in documents, display/save, statistics, …)
  • 27. myMT (automatic translation) • Adapted to the terminology and to the style of the domain • Training and evaluation phases • Multi Platforms (IE 8,9,10 – Safari – Firefox - Chrome) • Multi Formats - Multi Converters (doc,docx, odt, txt, pdf, html, wpf, …) • Multilingual interface (EN, FR, ES, AR, RU, …) • Adaptable to the client’s needs • Scalable (by cloning translation nodes) • Robust (Redundancy and auto-reparable) • WebService integration • Automatic translation • Choose the translation automata (corpus, languages) • Multiple formats • Preserve formatting and style for documents • Send final result by email
  • 28.
  • 29. Why a foundation? • Open source software are accessible for all • No more paying licences  The client has a non limited usage • Open for service companies (integration, installation, maintenance) • The client pays only services, usage is unlimited • The client contributes at enhancing the software • All the community benefits from the contributions • The sustainability of the software is independent from the company • Grants help to set up new projects • Easy collaboration with institutions (Universities, other foundations, …)

Editor's Notes

  1. aims at creating and distributing free software in the field of Open source software are accessible for all The client pays only services, usage is unlimited The client contributes at enhancing the software The sustainability of the software is independent from the company Easy collaboration with institutions (Universities, other foundations, …)