SlideShare a Scribd company logo
Introduction
Methodology
Experiment
Conclusion
Assessing Comparable Corpora
through Distributional Similarity Measures
Hernani Costa
hercos@uma.es
University of Malaga
June, 2015
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 1 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Overview
• Comparable Corpora (CC) is considered an important resource
in many research areas
automatic and assisted translation
language teaching
terminology
• Describing, comparing and evaluating CC are key issues in
these areas for which there is still a notable lack of standards
• Bearing this in mind, this work aims at investigating the use
of textual Distributional Similarity Measures (DSMs) as a tool
to assess the relatedness between docs in CC
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 2 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Motivation
• An inherent problem to those who deal with CC in a daily
basis is the uncertainty about the data they are dealing with
usually, tags like “casual speech transcripts” or “tourism
specialised comparable corpus” are not enough to describe a
corpus
• Most of the resources at our disposal are
built and shared without deep analysis of their content
without knowing nothing about the relatedness quality of the
corpus
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 3 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Objectives
Investigate the use of textual DSMs in the context of CC
• automatically measure the relatedness between docs
• analyse which input features perform better
• describe CC through the DSMs output scores
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 4 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
1) Data Preprocessing
• Sentence Detector and Tokeniser – OpenNLP1
• POS tagger and lemmatisation – TT4J2
• Stemming – Snowball3
• Stopword list4
2) Identifying the list of common entities between docs
• Three co-occurrence matrices
common tokens, common lemmas and common stems
1
https://opennlp.apache.org
2
http://reckart.github.io/tt4j/
3
http://snowball.tartarus.org
4
https://github.com/hpcosta/stopwords
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 5 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
3) Computing the similarity between docs
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
• Input: list of common tokens, lemmas and stems
• DSMs = {DSMNCE , DSMSCC , DSMχ2 }
NCE: Number of Common Entities
SCC: Spearman’s Rank Correlation Coefficient
χ2
: Chi-Square
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 6 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
4) Computing the doc final score
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
where
n: total number of docs
DSMi (dl , di ): the resulted similarity score between the doc dl with all the
docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 7 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
The INTELITERM Corpus
Statistical information about the various INTELITERM5
subcorpora
nDocs types tokens types
tokens description
en to 151 11,6k 508,9k 0.023 original
en totd 61 6,9k 88,5k 0.078 translated
es to 225 12,6k 253,4k 0.049 original
es totd 27 3,4k 19,7k 0.174 translated
5
http://www.lexytrad.es/proyectos.html
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 8 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Results
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
Average and standard deviation of
common tokens scores between doc
per subcorpus
NCT
en to
av 163.70
σ 83.87
en totd
av 67.54
σ 35.35
es to
av 31.97
σ 23.48
es totd
av 17.93
σ 8.46
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 9 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
General Findings
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Lemmas
Subcorpus
Averageofcommonlemmasperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Stemms
Subcorpus
Averageofcommonstemmsperdocument
distributions between the features are quite similar
→ it is possible to achieve acceptable results only using tokens
scores for each subcorpus is roughly symmetric
→ data is normally distributed
original docs vs. translated docs: original docs high NCE
→ different translators use different vocabulary and
consequently lower the NCE between the docs will be
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 10 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT per doc on average is higher + large IQR + long whiskers
→ data is more spread + average of NCT per doc is more
variable + wide type of docs (either highly or roughly
correlated to the rest of the docs)
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 11 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT: data is skewed right + SCC: high average scores +
χ2
: long whisker outside the upper quartile
→ docs have a high degree of relatedness between each other
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 12 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Translated Docs (en totd)
the NCT, the SCC and the χ2
scores suggest that the data is
either normally distributed or skewed left
→ docs are highly related
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 13 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish vs. English
lower NCT when compared with en to and en totd
→ Spanish has richer morphology (bigger number of inflection
forms per lemma)
→ less common tokens per doc
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 14 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish - Translated Docs (es totd)
NCT: low + SCC: data varies inside and outside the IQR +
χ2
: σ higher than its av.
→ inconstancy in the data
− subcorpus size (27 docs)
− low number of types & tokens
− high types
tokens
ratio suggests that a more diverse form of language is employed
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 15 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Summary
From the statistical and theoretical evidences
• en to, en totd and es to
assemble highly correlated docs
• es totd
the small number of docs and the scarceness of evidences only
allow was to not reject the idea that this subcorpus is
composed of similar docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 16 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Conclusion
• This work presents and studies various DSMs for the purpose
of describing specialised CC
• As input for these DSMs, we used three different input
features (lists of common tokens, lemmas and stems)
for the data in hand, these features had similar performance
for all the tested DSMs
• The high number of entities shared by its docs, the positive
average scores obtained with the SCC measure and their χ2
scores sustain that the INTELITERM corpus is composed of
highly correlated docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 17 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Future Work
• Perform more experiments with DSMs
use other languages and analyse if translated docs always
decrease the general relatedness score
evaluate other DSMs (e.g. Jaccard, Lin and Cosine)
add noisy docs (i.e. out of topic docs) to the corpus and
analyse the DSMs performance
→ Use this approach to automatically filter out docs with a
low level of relatedness
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 18 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Acknowledgements
Hernani Costa is supported by the People Programme (Marie Curie Actions) of the
European Union’s Framework Programme (FP7/2007-2013) under REA grant
agreement no 317471. Also, the research reported in this work has been partially
carried out in the framework of the Educational Innovation Project TRADICOR (PIE
13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881,
2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754,
2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and
Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work.
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 19 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
“If we knew what it was we were doing,
it wouldn’t be called ‘research’,
would it?”
Albert Einstein
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 20 / 20

More Related Content

Viewers also liked

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD
RIILP
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
RIILP
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - Acclaro
RIILP
 
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
RIILP
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
RIILP
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
RIILP
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic
RIILP
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
RIILP
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA
RIILP
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
RIILP
 
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
RIILP
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU
RIILP
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
RIILP
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
RIILP
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
RIILP
 
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
RIILP
 
7. Intellectual Property - Alberto Massidda (Translated)
7. Intellectual Property - Alberto Massidda (Translated)7. Intellectual Property - Alberto Massidda (Translated)
7. Intellectual Property - Alberto Massidda (Translated)
RIILP
 
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
RIILP
 
10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation
RIILP
 

Viewers also liked (19)

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - Acclaro
 
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
ESR6 Varvara Logacheva - EXPERT Summer School - Malaga 2015
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
ESR7 Carolina Scarton - EXPERT Summer School - Malaga 2015
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
 
7. Intellectual Property - Alberto Massidda (Translated)
7. Intellectual Property - Alberto Massidda (Translated)7. Intellectual Property - Alberto Massidda (Translated)
7. Intellectual Property - Alberto Massidda (Translated)
 
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
 
10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation
 

Similar to ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015

Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
Martin McMorrow
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Christophe Tricot
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
Francisco Manuel Rangel Pardo
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
DH Benelux
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
XXIII Curso Greta
XXIII Curso GretaXXIII Curso Greta
XXIII Curso Greta
Antonio Rafael Roldán Tapia
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
kevig
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
Pascual Pérez-Paredes
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
A statistical approach to term extraction.pdf
A statistical approach to term extraction.pdfA statistical approach to term extraction.pdf
A statistical approach to term extraction.pdf
Jasmine Dixon
 
Lesson plans Italy
Lesson plans ItalyLesson plans Italy
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
butest
 
Educational Research II, II Bimestre
Educational Research II, II BimestreEducational Research II, II Bimestre
Educational Research II, II Bimestre
Videoconferencias UTPL
 
Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...
Clara Clavijo Encalada
 
TIFLE Proposal
TIFLE ProposalTIFLE Proposal
TIFLE Proposal
maheyman
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
rental business
rental businessrental business
rental business
Phamay Nocillado
 
Ph.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and BehaviourPh.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and Behaviour
Ana Loureiro
 

Similar to ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015 (20)

Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
XXIII Curso Greta
XXIII Curso GretaXXIII Curso Greta
XXIII Curso Greta
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A statistical approach to term extraction.pdf
A statistical approach to term extraction.pdfA statistical approach to term extraction.pdf
A statistical approach to term extraction.pdf
 
Lesson plans Italy
Lesson plans ItalyLesson plans Italy
Lesson plans Italy
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
 
Educational Research II, II Bimestre
Educational Research II, II BimestreEducational Research II, II Bimestre
Educational Research II, II Bimestre
 
Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...
 
TIFLE Proposal
TIFLE ProposalTIFLE Proposal
TIFLE Proposal
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
rental business
rental businessrental business
rental business
 
Ph.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and BehaviourPh.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and Behaviour
 

More from RIILP

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
RIILP
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones
RIILP
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO
RIILP
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT
RIILP
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAAR
RIILP
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU
RIILP
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMA
RIILP
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW
RIILP
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAAR
RIILP
 
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
RIILP
 
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
RIILP
 
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
RIILP
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
RIILP
 
9. Ethics - Juan Jose Arevalillo Doval (Hermes)
9. Ethics - Juan Jose Arevalillo Doval (Hermes)9. Ethics - Juan Jose Arevalillo Doval (Hermes)
9. Ethics - Juan Jose Arevalillo Doval (Hermes)
RIILP
 
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
RIILP
 

More from RIILP (15)

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAAR
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMA
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAAR
 
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015
 
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
ESR5 Liling Tan - EXPERT Summer School - Malaga 2015
 
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
ESR8 Liangyou Li - EXPERT Summer School - Malaga 2015
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
9. Ethics - Juan Jose Arevalillo Doval (Hermes)
9. Ethics - Juan Jose Arevalillo Doval (Hermes)9. Ethics - Juan Jose Arevalillo Doval (Hermes)
9. Ethics - Juan Jose Arevalillo Doval (Hermes)
 
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
8. Transfer of Technology to Market and Commercial Exploitation of Results - ...
 

Recently uploaded

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 

Recently uploaded (20)

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 

ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015

  • 1. Introduction Methodology Experiment Conclusion Assessing Comparable Corpora through Distributional Similarity Measures Hernani Costa hercos@uma.es University of Malaga June, 2015 Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 1 / 20
  • 2. Introduction Methodology Experiment Conclusion Introduction Overview • Comparable Corpora (CC) is considered an important resource in many research areas automatic and assisted translation language teaching terminology • Describing, comparing and evaluating CC are key issues in these areas for which there is still a notable lack of standards • Bearing this in mind, this work aims at investigating the use of textual Distributional Similarity Measures (DSMs) as a tool to assess the relatedness between docs in CC Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 2 / 20
  • 3. Introduction Methodology Experiment Conclusion Introduction Motivation • An inherent problem to those who deal with CC in a daily basis is the uncertainty about the data they are dealing with usually, tags like “casual speech transcripts” or “tourism specialised comparable corpus” are not enough to describe a corpus • Most of the resources at our disposal are built and shared without deep analysis of their content without knowing nothing about the relatedness quality of the corpus Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 3 / 20
  • 4. Introduction Methodology Experiment Conclusion Introduction Objectives Investigate the use of textual DSMs in the context of CC • automatically measure the relatedness between docs • analyse which input features perform better • describe CC through the DSMs output scores Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 4 / 20
  • 5. Introduction Methodology Experiment Conclusion Methodology Methodology 1) Data Preprocessing • Sentence Detector and Tokeniser – OpenNLP1 • POS tagger and lemmatisation – TT4J2 • Stemming – Snowball3 • Stopword list4 2) Identifying the list of common entities between docs • Three co-occurrence matrices common tokens, common lemmas and common stems 1 https://opennlp.apache.org 2 http://reckart.github.io/tt4j/ 3 http://snowball.tartarus.org 4 https://github.com/hpcosta/stopwords Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 5 / 20
  • 6. Introduction Methodology Experiment Conclusion Methodology Methodology 3) Computing the similarity between docs 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … • Input: list of common tokens, lemmas and stems • DSMs = {DSMNCE , DSMSCC , DSMχ2 } NCE: Number of Common Entities SCC: Spearman’s Rank Correlation Coefficient χ2 : Chi-Square Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 6 / 20
  • 7. Introduction Methodology Experiment Conclusion Methodology Methodology 4) Computing the doc final score 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … where n: total number of docs DSMi (dl , di ): the resulted similarity score between the doc dl with all the docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 7 / 20
  • 8. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis The INTELITERM Corpus Statistical information about the various INTELITERM5 subcorpora nDocs types tokens types tokens description en to 151 11,6k 508,9k 0.023 original en totd 61 6,9k 88,5k 0.078 translated es to 225 12,6k 253,4k 0.049 original es totd 27 3,4k 19,7k 0.174 translated 5 http://www.lexytrad.es/proyectos.html Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 8 / 20
  • 9. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Results en_to en_totd es_to es_totd 050100150200250300 Common Tokens Subcorpus Averageofcommontokensperdocument Average and standard deviation of common tokens scores between doc per subcorpus NCT en to av 163.70 σ 83.87 en totd av 67.54 σ 35.35 es to av 31.97 σ 23.48 es totd av 17.93 σ 8.46 Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 9 / 20
  • 10. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis General Findings en_to en_totd es_to es_totd 050100150200250300 Common Tokens Subcorpus Averageofcommontokensperdocument en_to en_totd es_to es_totd 050100150200250300 Common Lemmas Subcorpus Averageofcommonlemmasperdocument en_to en_totd es_to es_totd 050100150200250300 Common Stemms Subcorpus Averageofcommonstemmsperdocument distributions between the features are quite similar → it is possible to achieve acceptable results only using tokens scores for each subcorpus is roughly symmetric → data is normally distributed original docs vs. translated docs: original docs high NCE → different translators use different vocabulary and consequently lower the NCE between the docs will be Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 10 / 20
  • 11. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Original Docs (en to) NCT per doc on average is higher + large IQR + long whiskers → data is more spread + average of NCT per doc is more variable + wide type of docs (either highly or roughly correlated to the rest of the docs) Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 11 / 20
  • 12. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Original Docs (en to) NCT: data is skewed right + SCC: high average scores + χ2 : long whisker outside the upper quartile → docs have a high degree of relatedness between each other Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 12 / 20
  • 13. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Translated Docs (en totd) the NCT, the SCC and the χ2 scores suggest that the data is either normally distributed or skewed left → docs are highly related Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 13 / 20
  • 14. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Spanish vs. English lower NCT when compared with en to and en totd → Spanish has richer morphology (bigger number of inflection forms per lemma) → less common tokens per doc Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 14 / 20
  • 15. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Spanish - Translated Docs (es totd) NCT: low + SCC: data varies inside and outside the IQR + χ2 : σ higher than its av. → inconstancy in the data − subcorpus size (27 docs) − low number of types & tokens − high types tokens ratio suggests that a more diverse form of language is employed Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 15 / 20
  • 16. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Summary From the statistical and theoretical evidences • en to, en totd and es to assemble highly correlated docs • es totd the small number of docs and the scarceness of evidences only allow was to not reject the idea that this subcorpus is composed of similar docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 16 / 20
  • 17. Introduction Methodology Experiment Conclusion Conclusion Future Work Conclusion • This work presents and studies various DSMs for the purpose of describing specialised CC • As input for these DSMs, we used three different input features (lists of common tokens, lemmas and stems) for the data in hand, these features had similar performance for all the tested DSMs • The high number of entities shared by its docs, the positive average scores obtained with the SCC measure and their χ2 scores sustain that the INTELITERM corpus is composed of highly correlated docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 17 / 20
  • 18. Introduction Methodology Experiment Conclusion Conclusion Future Work Future Work • Perform more experiments with DSMs use other languages and analyse if translated docs always decrease the general relatedness score evaluate other DSMs (e.g. Jaccard, Lin and Cosine) add noisy docs (i.e. out of topic docs) to the corpus and analyse the DSMs performance → Use this approach to automatically filter out docs with a low level of relatedness Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 18 / 20
  • 19. Introduction Methodology Experiment Conclusion Conclusion Future Work Acknowledgements Hernani Costa is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. Also, the research reported in this work has been partially carried out in the framework of the Educational Innovation Project TRADICOR (PIE 13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881, 2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754, 2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work. Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 19 / 20
  • 20. Introduction Methodology Experiment Conclusion Conclusion Future Work “If we knew what it was we were doing, it wouldn’t be called ‘research’, would it?” Albert Einstein Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 20 / 20