Franco-Thai Workshop 2010

Lingua et Machina
Research & Development

1
About me

●

●

●
●
●

●
●
●

Estelle Delpech
Research engineer at Lingua et Machina,
France
CAT tools provider
ed(at)lingua-et-machina(dot)com
www.lingua-et-machina.com
Ph. Candidate at LINA, France
taln team : specialises in NLP
estelle.delpech(at)univ-nantes(dot)fr

2
LINGUA ET MACHINA
●
●
●

●

French company
Founded by Dr E. Planas
Led by Dr. F. De Colstoun
Small but innovative
●
8 persons
●
2 R&D engineers / Ph. D. candidates
● NLP
● Computational Linguistics
● Translation Studies

3
LINGUA ET MACHINA

●

2002
●
●

●
●

SIMILIS
2nd generation translation
memories
Based on Ph.D. work

2007
●
●
●

LIBELLEX
Access to TM for non-professionals
Translation and terminology
management platform

4
They trust us

5
Partners

6
SIMILIS
●

●

●

●

Computer-aided translation
● Free -lance translators
● Translation agencies
Translation memories
●
Pre translations
Terminology extraction
7 languages : FR,EN,IT,ES,PT,DE,NL
→ rule based

7
Similis

Part

1/1

TITLE 1

8
SIMILIS technology
Based on the Ph. D. work of E. Planas
●
First generation translation memory
● Works with segments, sentences
●
Second generation translation memory
● Works with chunks
● [the driver] [steps] [on the gas pedal]
●
Chunking
● Rules written by linguists
●
Fuzzy matching
● Modified edit-distance
● Several linguistic levels
●

9
From SIMILIS to LIBELLEX

Source Text

French Documents

Moderator

Memory
(TMX)

Glossary
English Documents

Translated Text

(lexicon)

Moderator
Translators linguists
Business Experts
10
LIBELLEX

●

●

Translation memories meet corporate content
management
Target : global companies
●
Many languages
● customers
● Parterns
● employees
●
Speakers
● Non native
● Not language professionals
●
Terminology and translations needs
● Official documentation
● Day to day intern communication
11
Libellex

●

●

Terminology management platform
● builds corporate TM
● extract / check terminology
● help employees communicate
Translation management platform
● manage translations jobs
● terminologies for translation agencies
● chunk matches for MT

12
Libellex

Part

1/1

TITLE 1
●
●
●
●
●
●

Look up a word, a term, an expression
Manage terminology
Have a document translated
Check translations
Check text
Add new documents
13
R-D-I at Lingua et Machina
On going
●
Statistical term extraction
● « Cheap and quick » addition of new
languages
●
Consider hybridation with rule-based methods
●
Term alignment in comparable corpora
●
Modelize translation process
Planned
●
Development of rule-based chunking on
Chinese
●
Extraction of « Knowledge-rich contexts » for
terminologies
14
Research partnerships
●

●

●

●

●

Statistical term extraction and alignment
●
A. Lardilleux, Y. Lepage (Caen/Waseda)
Chinsese processing
●
EDF, Kinep
Comparable corpora
● National project + Ph. D. candidate
KRC extraction
● European project submission
Translation studies
● Ph. D. candidate : Stendhal University

15
Statistical term extraction and
alignment
●

●

●

Algorithm developed by A. Lardilleux in Ph. D.
Thesis
●
http://users.info.unicaen.fr/~alardill/
Uses “perfect alignments“
●
Source and target words that only occur in
the same source and target sentences
adf ↔ AD
b ↔ BE
b ↔ CF
a e ↔ AE
d
D
R n o ly b ild sm sa p s o co u
adm u s a
ll m
le f rp s
● Perfect alignments add-up
16
Chinese and other languages

●

●

●

●

Chinese processing
●
EDF uses Libellex
●
Needs ZH↔FR ZH ↔ EN translation
Currently :
●
Statistical term alignment and extraction
Planned :
●
Chinese chunking rule
●
Develop hybrid statistical/rule-based
chunk alignment
Other languages :
●
Asian
●
Northern european
●
Eastern european
17
Metricc projetc

●
●

●

Scope : national
Bilingual terminologies mining from
comparable corpora
●
CAT
●
Translation memories
●
CLIR
Partners
● Syllabs, Sinéqua, LM
● IMAG, Valoria

http://www.metricc.com
18
Metricc : term alignment in comparable
corpora
●

●

●
●

●

Based on distributional analysis hypothesis
●
Words that appear in similar contexts
have similar meaning
Represent context of a word in vector :
●
Word cooccurrents + normalized
frequencies
Translate context vector with seed lexicon
Compute distance between source and target
vectors
The closer , the better

19
Knowledge-Rich Contexts Extraction
●
●
●

●

Project under submission
Scope : european
Partners :
●
Inbenta , BEO
●
Lljublana University, LINA
Knowlege-rich contexts
●
Help understand the term
●
Indicates of to use the term

20
Knowledge-Rich Contexts Extraction
●

●

●

Examples of KRC :
●
Contains of definition
●
Describes a relation between two terms
●
Indicates a collocation
●
Illustrates the term
KRC linguistic description
●
Exemples, definitions in dictionaries
●
Corpus study
KRC automatic identification
●
Morpho syntactic patterns
●
Statistical clues
21
Modelization of translation process

●

●
●

●
●

●

Research engineer / Ph. D. Thesis
●
Department of translations studies
●
Université Stendhal, Grenoble
How do we translate ?
What knowledge is helpful to
translators ?
What is a good translation ?
Do non-professional translate
differently ?
How do you improve software usability
?

22
More information
●

●

●

Lingua et Machina
● www.lingua-et-machina.com/
● contact(a)lingua-et-machina.com
Libellex
● http://libellex.fr/
Download Similis
● http://similis.org/Download/SimilisFreel
ance-2.16.04-Setup.exe

23
Franco-Thai Workshop 2010
Thank you
ed(a)lingua-et-machina.com

24

R&D Lingua et Machina

  • 1.
    Franco-Thai Workshop 2010 Linguaet Machina Research & Development 1
  • 2.
    About me ● ● ● ● ● ● ● ● Estelle Delpech Researchengineer at Lingua et Machina, France CAT tools provider ed(at)lingua-et-machina(dot)com www.lingua-et-machina.com Ph. Candidate at LINA, France taln team : specialises in NLP estelle.delpech(at)univ-nantes(dot)fr 2
  • 3.
    LINGUA ET MACHINA ● ● ● ● Frenchcompany Founded by Dr E. Planas Led by Dr. F. De Colstoun Small but innovative ● 8 persons ● 2 R&D engineers / Ph. D. candidates ● NLP ● Computational Linguistics ● Translation Studies 3
  • 4.
    LINGUA ET MACHINA ● 2002 ● ● ● ● SIMILIS 2ndgeneration translation memories Based on Ph.D. work 2007 ● ● ● LIBELLEX Access to TM for non-professionals Translation and terminology management platform 4
  • 5.
  • 6.
  • 7.
    SIMILIS ● ● ● ● Computer-aided translation ● Free-lance translators ● Translation agencies Translation memories ● Pre translations Terminology extraction 7 languages : FR,EN,IT,ES,PT,DE,NL → rule based 7
  • 8.
  • 9.
    SIMILIS technology Based onthe Ph. D. work of E. Planas ● First generation translation memory ● Works with segments, sentences ● Second generation translation memory ● Works with chunks ● [the driver] [steps] [on the gas pedal] ● Chunking ● Rules written by linguists ● Fuzzy matching ● Modified edit-distance ● Several linguistic levels ● 9
  • 10.
    From SIMILIS toLIBELLEX Source Text French Documents Moderator Memory (TMX) Glossary English Documents Translated Text (lexicon) Moderator Translators linguists Business Experts 10
  • 11.
    LIBELLEX ● ● Translation memories meetcorporate content management Target : global companies ● Many languages ● customers ● Parterns ● employees ● Speakers ● Non native ● Not language professionals ● Terminology and translations needs ● Official documentation ● Day to day intern communication 11
  • 12.
    Libellex ● ● Terminology management platform ●builds corporate TM ● extract / check terminology ● help employees communicate Translation management platform ● manage translations jobs ● terminologies for translation agencies ● chunk matches for MT 12
  • 13.
    Libellex Part 1/1 TITLE 1 ● ● ● ● ● ● Look upa word, a term, an expression Manage terminology Have a document translated Check translations Check text Add new documents 13
  • 14.
    R-D-I at Linguaet Machina On going ● Statistical term extraction ● « Cheap and quick » addition of new languages ● Consider hybridation with rule-based methods ● Term alignment in comparable corpora ● Modelize translation process Planned ● Development of rule-based chunking on Chinese ● Extraction of « Knowledge-rich contexts » for terminologies 14
  • 15.
    Research partnerships ● ● ● ● ● Statistical termextraction and alignment ● A. Lardilleux, Y. Lepage (Caen/Waseda) Chinsese processing ● EDF, Kinep Comparable corpora ● National project + Ph. D. candidate KRC extraction ● European project submission Translation studies ● Ph. D. candidate : Stendhal University 15
  • 16.
    Statistical term extractionand alignment ● ● ● Algorithm developed by A. Lardilleux in Ph. D. Thesis ● http://users.info.unicaen.fr/~alardill/ Uses “perfect alignments“ ● Source and target words that only occur in the same source and target sentences adf ↔ AD b ↔ BE b ↔ CF a e ↔ AE d D R n o ly b ild sm sa p s o co u adm u s a ll m le f rp s ● Perfect alignments add-up 16
  • 17.
    Chinese and otherlanguages ● ● ● ● Chinese processing ● EDF uses Libellex ● Needs ZH↔FR ZH ↔ EN translation Currently : ● Statistical term alignment and extraction Planned : ● Chinese chunking rule ● Develop hybrid statistical/rule-based chunk alignment Other languages : ● Asian ● Northern european ● Eastern european 17
  • 18.
    Metricc projetc ● ● ● Scope :national Bilingual terminologies mining from comparable corpora ● CAT ● Translation memories ● CLIR Partners ● Syllabs, Sinéqua, LM ● IMAG, Valoria http://www.metricc.com 18
  • 19.
    Metricc : termalignment in comparable corpora ● ● ● ● ● Based on distributional analysis hypothesis ● Words that appear in similar contexts have similar meaning Represent context of a word in vector : ● Word cooccurrents + normalized frequencies Translate context vector with seed lexicon Compute distance between source and target vectors The closer , the better 19
  • 20.
    Knowledge-Rich Contexts Extraction ● ● ● ● Projectunder submission Scope : european Partners : ● Inbenta , BEO ● Lljublana University, LINA Knowlege-rich contexts ● Help understand the term ● Indicates of to use the term 20
  • 21.
    Knowledge-Rich Contexts Extraction ● ● ● Examplesof KRC : ● Contains of definition ● Describes a relation between two terms ● Indicates a collocation ● Illustrates the term KRC linguistic description ● Exemples, definitions in dictionaries ● Corpus study KRC automatic identification ● Morpho syntactic patterns ● Statistical clues 21
  • 22.
    Modelization of translationprocess ● ● ● ● ● ● Research engineer / Ph. D. Thesis ● Department of translations studies ● Université Stendhal, Grenoble How do we translate ? What knowledge is helpful to translators ? What is a good translation ? Do non-professional translate differently ? How do you improve software usability ? 22
  • 23.
    More information ● ● ● Lingua etMachina ● www.lingua-et-machina.com/ ● contact(a)lingua-et-machina.com Libellex ● http://libellex.fr/ Download Similis ● http://similis.org/Download/SimilisFreel ance-2.16.04-Setup.exe 23
  • 24.
    Franco-Thai Workshop 2010 Thankyou ed(a)lingua-et-machina.com 24