EXPERT_Malaga-ESR03

Introduction
Methodology
Experiment
Conclusion
Assessing Comparable Corpora
through Distributional Similarity Measures
Hernani Costa
hercos@uma.es
University of Malaga
June, 2015
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 1 / 20

Introduction
Methodology
Experiment
Conclusion
Introduction
Overview
• Comparable Corpora (CC) is considered an important resource
in many research areas
automatic and assisted translation
language teaching
terminology
• Describing, comparing and evaluating CC are key issues in
these areas for which there is still a notable lack of standards
• Bearing this in mind, this work aims at investigating the use
of textual Distributional Similarity Measures (DSMs) as a tool
to assess the relatedness between docs in CC

Introduction
Methodology
Experiment
Conclusion
Introduction
Motivation
• An inherent problem to those who deal with CC in a daily
basis is the uncertainty about the data they are dealing with
usually, tags like “casual speech transcripts” or “tourism
specialised comparable corpus” are not enough to describe a
corpus
• Most of the resources at our disposal are
built and shared without deep analysis of their content
without knowing nothing about the relatedness quality of the
corpus

Introduction
Methodology
Experiment
Conclusion
Introduction
Objectives
Investigate the use of textual DSMs in the context of CC
• automatically measure the relatedness between docs
• analyse which input features perform better
• describe CC through the DSMs output scores

Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
1) Data Preprocessing
• Sentence Detector and Tokeniser – OpenNLP1
• POS tagger and lemmatisation – TT4J2
• Stemming – Snowball3
• Stopword list4
2) Identifying the list of common entities between docs
• Three co-occurrence matrices
common tokens, common lemmas and common stems
1
https://opennlp.apache.org
2
http://reckart.github.io/tt4j/
3
http://snowball.tartarus.org
4
https://github.com/hpcosta/stopwords

Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
3) Computing the similarity between docs
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
• Input: list of common tokens, lemmas and stems
• DSMs = {DSMNCE , DSMSCC , DSMχ2 }
NCE: Number of Common Entities
SCC: Spearman’s Rank Correlation Coeﬃcient
χ2
: Chi-Square

Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
4) Computing the doc ﬁnal score
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
where
n: total number of docs
DSMi (dl , di ): the resulted similarity score between the doc dl with all the
docs

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
The INTELITERM Corpus
Statistical information about the various INTELITERM5
subcorpora
nDocs types tokens types
tokens description
en to 151 11,6k 508,9k 0.023 original
en totd 61 6,9k 88,5k 0.078 translated
es to 225 12,6k 253,4k 0.049 original
es totd 27 3,4k 19,7k 0.174 translated
5
http://www.lexytrad.es/proyectos.html

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Results
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
Average and standard deviation of
common tokens scores between doc
per subcorpus
NCT
en to
av 163.70
σ 83.87
en totd
av 67.54
σ 35.35
es to
av 31.97
σ 23.48
es totd
av 17.93
σ 8.46

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
General Findings
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
050100150200250300
Common Lemmas
Subcorpus
Averageofcommonlemmasperdocument
050100150200250300
Common Stemms
Subcorpus
Averageofcommonstemmsperdocument
distributions between the features are quite similar
→ it is possible to achieve acceptable results only using tokens
scores for each subcorpus is roughly symmetric
→ data is normally distributed
original docs vs. translated docs: original docs high NCE
→ diﬀerent translators use diﬀerent vocabulary and
consequently lower the NCE between the docs will be

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT per doc on average is higher + large IQR + long whiskers
→ data is more spread + average of NCT per doc is more
variable + wide type of docs (either highly or roughly
correlated to the rest of the docs)

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT: data is skewed right + SCC: high average scores +
χ2
: long whisker outside the upper quartile
→ docs have a high degree of relatedness between each other

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Translated Docs (en totd)
the NCT, the SCC and the χ2
scores suggest that the data is
either normally distributed or skewed left
→ docs are highly related

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish vs. English
lower NCT when compared with en to and en totd
→ Spanish has richer morphology (bigger number of inﬂection
forms per lemma)
→ less common tokens per doc

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish - Translated Docs (es totd)
NCT: low + SCC: data varies inside and outside the IQR +
χ2
: σ higher than its av.
→ inconstancy in the data
− subcorpus size (27 docs)
− low number of types & tokens
− high types
tokens
ratio suggests that a more diverse form of language is employed

Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Summary
From the statistical and theoretical evidences
• en to, en totd and es to
assemble highly correlated docs
• es totd
the small number of docs and the scarceness of evidences only
allow was to not reject the idea that this subcorpus is
composed of similar docs

Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Conclusion
• This work presents and studies various DSMs for the purpose
of describing specialised CC
• As input for these DSMs, we used three diﬀerent input
features (lists of common tokens, lemmas and stems)
for the data in hand, these features had similar performance
for all the tested DSMs
• The high number of entities shared by its docs, the positive
average scores obtained with the SCC measure and their χ2
scores sustain that the INTELITERM corpus is composed of
highly correlated docs

Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Future Work
• Perform more experiments with DSMs
use other languages and analyse if translated docs always
decrease the general relatedness score
evaluate other DSMs (e.g. Jaccard, Lin and Cosine)
add noisy docs (i.e. out of topic docs) to the corpus and
analyse the DSMs performance
→ Use this approach to automatically ﬁlter out docs with a
low level of relatedness

Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Acknowledgements
Hernani Costa is supported by the People Programme (Marie Curie Actions) of the
European Union’s Framework Programme (FP7/2007-2013) under REA grant
agreement no 317471. Also, the research reported in this work has been partially
carried out in the framework of the Educational Innovation Project TRADICOR (PIE
13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881,
2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754,
2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and
Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work.

Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
“If we knew what it was we were doing,
it wouldn’t be called ‘research’,
would it?”
Albert Einstein

EXPERT_Malaga-ESR03

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to EXPERT_Malaga-ESR03

Similar to EXPERT_Malaga-ESR03 (20)

EXPERT_Malaga-ESR03