SlideShare a Scribd company logo
Introduction
Methodology
Experiment
Conclusion
Assessing Comparable Corpora
through Distributional Similarity Measures
Hernani Costa
hercos@uma.es
University of Malaga
June, 2015
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 1 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Overview
• Comparable Corpora (CC) is considered an important resource
in many research areas
automatic and assisted translation
language teaching
terminology
• Describing, comparing and evaluating CC are key issues in
these areas for which there is still a notable lack of standards
• Bearing this in mind, this work aims at investigating the use
of textual Distributional Similarity Measures (DSMs) as a tool
to assess the relatedness between docs in CC
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 2 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Motivation
• An inherent problem to those who deal with CC in a daily
basis is the uncertainty about the data they are dealing with
usually, tags like “casual speech transcripts” or “tourism
specialised comparable corpus” are not enough to describe a
corpus
• Most of the resources at our disposal are
built and shared without deep analysis of their content
without knowing nothing about the relatedness quality of the
corpus
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 3 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Objectives
Investigate the use of textual DSMs in the context of CC
• automatically measure the relatedness between docs
• analyse which input features perform better
• describe CC through the DSMs output scores
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 4 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
1) Data Preprocessing
• Sentence Detector and Tokeniser – OpenNLP1
• POS tagger and lemmatisation – TT4J2
• Stemming – Snowball3
• Stopword list4
2) Identifying the list of common entities between docs
• Three co-occurrence matrices
common tokens, common lemmas and common stems
1
https://opennlp.apache.org
2
http://reckart.github.io/tt4j/
3
http://snowball.tartarus.org
4
https://github.com/hpcosta/stopwords
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 5 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
3) Computing the similarity between docs
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
• Input: list of common tokens, lemmas and stems
• DSMs = {DSMNCE , DSMSCC , DSMχ2 }
NCE: Number of Common Entities
SCC: Spearman’s Rank Correlation Coefficient
χ2
: Chi-Square
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 6 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
4) Computing the doc final score
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
where
n: total number of docs
DSMi (dl , di ): the resulted similarity score between the doc dl with all the
docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 7 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
The INTELITERM Corpus
Statistical information about the various INTELITERM5
subcorpora
nDocs types tokens types
tokens description
en to 151 11,6k 508,9k 0.023 original
en totd 61 6,9k 88,5k 0.078 translated
es to 225 12,6k 253,4k 0.049 original
es totd 27 3,4k 19,7k 0.174 translated
5
http://www.lexytrad.es/proyectos.html
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 8 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Results
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
Average and standard deviation of
common tokens scores between doc
per subcorpus
NCT
en to
av 163.70
σ 83.87
en totd
av 67.54
σ 35.35
es to
av 31.97
σ 23.48
es totd
av 17.93
σ 8.46
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 9 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
General Findings
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Lemmas
Subcorpus
Averageofcommonlemmasperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Stemms
Subcorpus
Averageofcommonstemmsperdocument
distributions between the features are quite similar
→ it is possible to achieve acceptable results only using tokens
scores for each subcorpus is roughly symmetric
→ data is normally distributed
original docs vs. translated docs: original docs high NCE
→ different translators use different vocabulary and
consequently lower the NCE between the docs will be
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 10 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT per doc on average is higher + large IQR + long whiskers
→ data is more spread + average of NCT per doc is more
variable + wide type of docs (either highly or roughly
correlated to the rest of the docs)
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 11 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT: data is skewed right + SCC: high average scores +
χ2
: long whisker outside the upper quartile
→ docs have a high degree of relatedness between each other
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 12 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Translated Docs (en totd)
the NCT, the SCC and the χ2
scores suggest that the data is
either normally distributed or skewed left
→ docs are highly related
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 13 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish vs. English
lower NCT when compared with en to and en totd
→ Spanish has richer morphology (bigger number of inflection
forms per lemma)
→ less common tokens per doc
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 14 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish - Translated Docs (es totd)
NCT: low + SCC: data varies inside and outside the IQR +
χ2
: σ higher than its av.
→ inconstancy in the data
− subcorpus size (27 docs)
− low number of types & tokens
− high types
tokens
ratio suggests that a more diverse form of language is employed
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 15 / 20
Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Summary
From the statistical and theoretical evidences
• en to, en totd and es to
assemble highly correlated docs
• es totd
the small number of docs and the scarceness of evidences only
allow was to not reject the idea that this subcorpus is
composed of similar docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 16 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Conclusion
• This work presents and studies various DSMs for the purpose
of describing specialised CC
• As input for these DSMs, we used three different input
features (lists of common tokens, lemmas and stems)
for the data in hand, these features had similar performance
for all the tested DSMs
• The high number of entities shared by its docs, the positive
average scores obtained with the SCC measure and their χ2
scores sustain that the INTELITERM corpus is composed of
highly correlated docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 17 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Future Work
• Perform more experiments with DSMs
use other languages and analyse if translated docs always
decrease the general relatedness score
evaluate other DSMs (e.g. Jaccard, Lin and Cosine)
add noisy docs (i.e. out of topic docs) to the corpus and
analyse the DSMs performance
→ Use this approach to automatically filter out docs with a
low level of relatedness
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 18 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Acknowledgements
Hernani Costa is supported by the People Programme (Marie Curie Actions) of the
European Union’s Framework Programme (FP7/2007-2013) under REA grant
agreement no 317471. Also, the research reported in this work has been partially
carried out in the framework of the Educational Innovation Project TRADICOR (PIE
13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881,
2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754,
2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and
Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work.
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 19 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
“If we knew what it was we were doing,
it wouldn’t be called ‘research’,
would it?”
Albert Einstein
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 20 / 20

More Related Content

Viewers also liked

Purusharth paramdev
Purusharth paramdevPurusharth paramdev
Purusharth paramdevgurusewa
 
Weekly news 9
Weekly news 9Weekly news 9
Weekly news 9samankit
 
Sf Core Summit
Sf Core SummitSf Core Summit
Sf Core Summitcyberswat
 
Ley 1014
Ley 1014Ley 1014
Anmol bhajansangrah
Anmol bhajansangrahAnmol bhajansangrah
Anmol bhajansangrahgurusewa
 
Gagar meinsagar
Gagar meinsagarGagar meinsagar
Gagar meinsagargurusewa
 
Mapata Atatu (Chichewa)
Mapata Atatu (Chichewa)Mapata Atatu (Chichewa)
Mapata Atatu (Chichewa)
rmilanzi01
 
Kavya pushpanjali
Kavya pushpanjaliKavya pushpanjali
Kavya pushpanjaligurusewa
 
D7 entities fields
D7 entities fieldsD7 entities fields
D7 entities fields
cyberswat
 
Ppt0000001
Ppt0000001Ppt0000001
Ppt0000001
Ángeles Cuéllar
 
Vip First Class Turkey 2012
Vip First Class Turkey 2012Vip First Class Turkey 2012
Vip First Class Turkey 2012
g_emrahay
 
טקס יום הזיכרון לחללי צהל
טקס יום הזיכרון לחללי צהלטקס יום הזיכרון לחללי צהל
טקס יום הזיכרון לחללי צהלguest60cd323
 
Workplace Economy Slides September 2011 9.7.2011
Workplace Economy Slides September 2011 9.7.2011Workplace Economy Slides September 2011 9.7.2011
Workplace Economy Slides September 2011 9.7.2011Michael Shafran
 
The bible; how to study and understand it#2
The bible; how to study and understand it#2The bible; how to study and understand it#2
The bible; how to study and understand it#2Rick Dupperon
 
Yog yatra3
Yog yatra3Yog yatra3
Yog yatra3gurusewa
 
Mehakate phool
Mehakate phoolMehakate phool
Mehakate phoolgurusewa
 
Power Point
Power PointPower Point
Power PointJoseba
 
Blogs and wikis keynote
Blogs and wikis keynoteBlogs and wikis keynote
Blogs and wikis keynoteTeresa Wells
 
Integral politics
Integral politicsIntegral politics
Integral politics
Integral Vision
 
SITO Presentaion - by Paul
SITO Presentaion - by PaulSITO Presentaion - by Paul
SITO Presentaion - by Paul
Paul Hiscox
 

Viewers also liked (20)

Purusharth paramdev
Purusharth paramdevPurusharth paramdev
Purusharth paramdev
 
Weekly news 9
Weekly news 9Weekly news 9
Weekly news 9
 
Sf Core Summit
Sf Core SummitSf Core Summit
Sf Core Summit
 
Ley 1014
Ley 1014Ley 1014
Ley 1014
 
Anmol bhajansangrah
Anmol bhajansangrahAnmol bhajansangrah
Anmol bhajansangrah
 
Gagar meinsagar
Gagar meinsagarGagar meinsagar
Gagar meinsagar
 
Mapata Atatu (Chichewa)
Mapata Atatu (Chichewa)Mapata Atatu (Chichewa)
Mapata Atatu (Chichewa)
 
Kavya pushpanjali
Kavya pushpanjaliKavya pushpanjali
Kavya pushpanjali
 
D7 entities fields
D7 entities fieldsD7 entities fields
D7 entities fields
 
Ppt0000001
Ppt0000001Ppt0000001
Ppt0000001
 
Vip First Class Turkey 2012
Vip First Class Turkey 2012Vip First Class Turkey 2012
Vip First Class Turkey 2012
 
טקס יום הזיכרון לחללי צהל
טקס יום הזיכרון לחללי צהלטקס יום הזיכרון לחללי צהל
טקס יום הזיכרון לחללי צהל
 
Workplace Economy Slides September 2011 9.7.2011
Workplace Economy Slides September 2011 9.7.2011Workplace Economy Slides September 2011 9.7.2011
Workplace Economy Slides September 2011 9.7.2011
 
The bible; how to study and understand it#2
The bible; how to study and understand it#2The bible; how to study and understand it#2
The bible; how to study and understand it#2
 
Yog yatra3
Yog yatra3Yog yatra3
Yog yatra3
 
Mehakate phool
Mehakate phoolMehakate phool
Mehakate phool
 
Power Point
Power PointPower Point
Power Point
 
Blogs and wikis keynote
Blogs and wikis keynoteBlogs and wikis keynote
Blogs and wikis keynote
 
Integral politics
Integral politicsIntegral politics
Integral politics
 
SITO Presentaion - by Paul
SITO Presentaion - by PaulSITO Presentaion - by Paul
SITO Presentaion - by Paul
 

Similar to EXPERT_Malaga-ESR03

Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
Martin McMorrow
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
Francisco Manuel Rangel Pardo
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
DH Benelux
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
XXIII Curso Greta
XXIII Curso GretaXXIII Curso Greta
XXIII Curso Greta
Antonio Rafael Roldán Tapia
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
kevig
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
Pascual Pérez-Paredes
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
A statistical approach to term extraction.pdf
A statistical approach to term extraction.pdfA statistical approach to term extraction.pdf
A statistical approach to term extraction.pdf
Jasmine Dixon
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.docbutest
 
Educational Research II, II Bimestre
Educational Research II, II BimestreEducational Research II, II Bimestre
Educational Research II, II Bimestre
Videoconferencias UTPL
 
Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...
Clara Clavijo Encalada
 
TIFLE Proposal
TIFLE ProposalTIFLE Proposal
TIFLE Proposal
maheyman
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
Ph.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and BehaviourPh.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and Behaviour
Ana Loureiro
 

Similar to EXPERT_Malaga-ESR03 (20)

Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
XXIII Curso Greta
XXIII Curso GretaXXIII Curso Greta
XXIII Curso Greta
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A statistical approach to term extraction.pdf
A statistical approach to term extraction.pdfA statistical approach to term extraction.pdf
A statistical approach to term extraction.pdf
 
Lesson plans Italy
Lesson plans ItalyLesson plans Italy
Lesson plans Italy
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
 
Educational Research II, II Bimestre
Educational Research II, II BimestreEducational Research II, II Bimestre
Educational Research II, II Bimestre
 
Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...Revised version of design, production, application and analysis of tblt march...
Revised version of design, production, application and analysis of tblt march...
 
TIFLE Proposal
TIFLE ProposalTIFLE Proposal
TIFLE Proposal
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
rental business
rental businessrental business
rental business
 
Ph.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and BehaviourPh.D. Methodology, Procedures and Behaviour
Ph.D. Methodology, Procedures and Behaviour
 

EXPERT_Malaga-ESR03

  • 1. Introduction Methodology Experiment Conclusion Assessing Comparable Corpora through Distributional Similarity Measures Hernani Costa hercos@uma.es University of Malaga June, 2015 Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 1 / 20
  • 2. Introduction Methodology Experiment Conclusion Introduction Overview • Comparable Corpora (CC) is considered an important resource in many research areas automatic and assisted translation language teaching terminology • Describing, comparing and evaluating CC are key issues in these areas for which there is still a notable lack of standards • Bearing this in mind, this work aims at investigating the use of textual Distributional Similarity Measures (DSMs) as a tool to assess the relatedness between docs in CC Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 2 / 20
  • 3. Introduction Methodology Experiment Conclusion Introduction Motivation • An inherent problem to those who deal with CC in a daily basis is the uncertainty about the data they are dealing with usually, tags like “casual speech transcripts” or “tourism specialised comparable corpus” are not enough to describe a corpus • Most of the resources at our disposal are built and shared without deep analysis of their content without knowing nothing about the relatedness quality of the corpus Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 3 / 20
  • 4. Introduction Methodology Experiment Conclusion Introduction Objectives Investigate the use of textual DSMs in the context of CC • automatically measure the relatedness between docs • analyse which input features perform better • describe CC through the DSMs output scores Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 4 / 20
  • 5. Introduction Methodology Experiment Conclusion Methodology Methodology 1) Data Preprocessing • Sentence Detector and Tokeniser – OpenNLP1 • POS tagger and lemmatisation – TT4J2 • Stemming – Snowball3 • Stopword list4 2) Identifying the list of common entities between docs • Three co-occurrence matrices common tokens, common lemmas and common stems 1 https://opennlp.apache.org 2 http://reckart.github.io/tt4j/ 3 http://snowball.tartarus.org 4 https://github.com/hpcosta/stopwords Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 5 / 20
  • 6. Introduction Methodology Experiment Conclusion Methodology Methodology 3) Computing the similarity between docs 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … • Input: list of common tokens, lemmas and stems • DSMs = {DSMNCE , DSMSCC , DSMχ2 } NCE: Number of Common Entities SCC: Spearman’s Rank Correlation Coefficient χ2 : Chi-Square Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 6 / 20
  • 7. Introduction Methodology Experiment Conclusion Methodology Methodology 4) Computing the doc final score 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … where n: total number of docs DSMi (dl , di ): the resulted similarity score between the doc dl with all the docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 7 / 20
  • 8. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis The INTELITERM Corpus Statistical information about the various INTELITERM5 subcorpora nDocs types tokens types tokens description en to 151 11,6k 508,9k 0.023 original en totd 61 6,9k 88,5k 0.078 translated es to 225 12,6k 253,4k 0.049 original es totd 27 3,4k 19,7k 0.174 translated 5 http://www.lexytrad.es/proyectos.html Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 8 / 20
  • 9. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Results en_to en_totd es_to es_totd 050100150200250300 Common Tokens Subcorpus Averageofcommontokensperdocument Average and standard deviation of common tokens scores between doc per subcorpus NCT en to av 163.70 σ 83.87 en totd av 67.54 σ 35.35 es to av 31.97 σ 23.48 es totd av 17.93 σ 8.46 Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 9 / 20
  • 10. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis General Findings en_to en_totd es_to es_totd 050100150200250300 Common Tokens Subcorpus Averageofcommontokensperdocument en_to en_totd es_to es_totd 050100150200250300 Common Lemmas Subcorpus Averageofcommonlemmasperdocument en_to en_totd es_to es_totd 050100150200250300 Common Stemms Subcorpus Averageofcommonstemmsperdocument distributions between the features are quite similar → it is possible to achieve acceptable results only using tokens scores for each subcorpus is roughly symmetric → data is normally distributed original docs vs. translated docs: original docs high NCE → different translators use different vocabulary and consequently lower the NCE between the docs will be Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 10 / 20
  • 11. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Original Docs (en to) NCT per doc on average is higher + large IQR + long whiskers → data is more spread + average of NCT per doc is more variable + wide type of docs (either highly or roughly correlated to the rest of the docs) Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 11 / 20
  • 12. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Original Docs (en to) NCT: data is skewed right + SCC: high average scores + χ2 : long whisker outside the upper quartile → docs have a high degree of relatedness between each other Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 12 / 20
  • 13. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis English - Translated Docs (en totd) the NCT, the SCC and the χ2 scores suggest that the data is either normally distributed or skewed left → docs are highly related Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 13 / 20
  • 14. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Spanish vs. English lower NCT when compared with en to and en totd → Spanish has richer morphology (bigger number of inflection forms per lemma) → less common tokens per doc Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 14 / 20
  • 15. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Spanish - Translated Docs (es totd) NCT: low + SCC: data varies inside and outside the IQR + χ2 : σ higher than its av. → inconstancy in the data − subcorpus size (27 docs) − low number of types & tokens − high types tokens ratio suggests that a more diverse form of language is employed Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 15 / 20
  • 16. Introduction Methodology Experiment Conclusion The Corpus Results & Analysis Summary From the statistical and theoretical evidences • en to, en totd and es to assemble highly correlated docs • es totd the small number of docs and the scarceness of evidences only allow was to not reject the idea that this subcorpus is composed of similar docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 16 / 20
  • 17. Introduction Methodology Experiment Conclusion Conclusion Future Work Conclusion • This work presents and studies various DSMs for the purpose of describing specialised CC • As input for these DSMs, we used three different input features (lists of common tokens, lemmas and stems) for the data in hand, these features had similar performance for all the tested DSMs • The high number of entities shared by its docs, the positive average scores obtained with the SCC measure and their χ2 scores sustain that the INTELITERM corpus is composed of highly correlated docs Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 17 / 20
  • 18. Introduction Methodology Experiment Conclusion Conclusion Future Work Future Work • Perform more experiments with DSMs use other languages and analyse if translated docs always decrease the general relatedness score evaluate other DSMs (e.g. Jaccard, Lin and Cosine) add noisy docs (i.e. out of topic docs) to the corpus and analyse the DSMs performance → Use this approach to automatically filter out docs with a low level of relatedness Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 18 / 20
  • 19. Introduction Methodology Experiment Conclusion Conclusion Future Work Acknowledgements Hernani Costa is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. Also, the research reported in this work has been partially carried out in the framework of the Educational Innovation Project TRADICOR (PIE 13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881, 2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754, 2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work. Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 19 / 20
  • 20. Introduction Methodology Experiment Conclusion Conclusion Future Work “If we knew what it was we were doing, it wouldn’t be called ‘research’, would it?” Albert Einstein Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 20 / 20