SlideShare a Scribd company logo
1 of 20
Download to read offline
Introduction
Methodology
Experiment
Conclusion
Measuring the Relatedness between Documents in
Comparable Corpora
Hernani Costaa, Gloria Corpas Pastora and Ruslan Mitkovb
{hercos,gcorpas}@uma.es, r.mitkov@wlv.ac.uk
aLEXYTRAD, University of Malaga, Spain
bRIILP, University of Wolverhampton, UK
November, 2015
Hernani Costa hercos@uma.es TIA | Granada, Spain 1 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Overview
• Comparable Corpora (CC)
automatic and assisted translation
language teaching
terminology
• Describing, comparing and evaluating CC
lack of standards
• This work aims at investigating the use of Distributional
Similarity Measures (DSMs) as a tool to assess CC by
extracting
measuring
ranking
Hernani Costa hercos@uma.es TIA | Granada, Spain 2 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Motivation
• An inherent problem to those who deal with CC in a daily
basis is the uncertainty about the data they are dealing with
tags like “casual speech transcripts” or “tourism specialised
comparable corpus” are not enough to describe a corpus
• Most of the resources at our disposal are
built and shared without deep analysis of their content
used without knowing nothing about the relatedness quality of
the corpus
Hernani Costa hercos@uma.es TIA | Granada, Spain 3 / 20
Introduction
Methodology
Experiment
Conclusion
Introduction
Objectives
Investigate the use of textual DSMs in the context of CC
• automatically measure the relatedness between docs
• describe CC through the DSMs output scores
• analyse which features perform better
• rank docs by their degree of relatedness
Hernani Costa hercos@uma.es TIA | Granada, Spain 4 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
1) Data Preprocessing
• Sentence Detector and Tokeniser – OpenNLP1
• POS tagger and lemmatisation – TT4J2
• Stemming – Snowball3
• Stopword list4
2) Identifying the list of common entities between docs
• Three co-occurrence matrices
common tokens, common lemmas and common stems
1
https://opennlp.apache.org
2
http://reckart.github.io/tt4j/
3
http://snowball.tartarus.org
4
https://github.com/hpcosta/stopwords
Hernani Costa hercos@uma.es TIA | Granada, Spain 5 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
3) Computing the similarity between docs
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
• Input: list of common tokens, lemmas and stems
• DSMs = {DSMCE , DSMSCC , DSMχ2 }
CE: number of Common Entities
SCC: Spearman’s Rank Correlation Coefficient
χ2
: Chi-Square
Hernani Costa hercos@uma.es TIA | Granada, Spain 6 / 20
Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
4) Computing the doc final score
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
where
n: total number of docs
DSMi (dl , di ): the resulted similarity score between the doc dl with all the
docs
5) Ranking docs
• descending order according to their DSMs scores
Hernani Costa hercos@uma.es TIA | Granada, Spain 7 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
Corpora
Statistical information about the various subcorpora
nDocs types tokens types
tokens
int en 151 11,6k 496,2k 0.023
eur en 30 3.4k 29,8k 0.116
int es 224 13,2k 207,3k 0.063
eur es 44 5,6k 43,5k 0.129
int it 150 19,9k 386,2k 0.052
eur it 30 4,7k 29,6k 0.159
• int en, int es and int it: INTELITERM’s docs in English,
Spanish and Italian
• eur en, eur es and eur it: docs randomly selected from the
“one per day” Europarl v.7
Hernani Costa hercos@uma.es TIA | Granada, Spain 8 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
INTELITERM corpus
Descriptive Statistics
int_en int_es int_it
050100150200250300
Common Tokens
Subcorpora
Averageofcommontokensperdocument
Average and standard deviation of
common tokens scores between docs
per subcorpus
NCT
int en
av 163.70
σ 83.87
int es
av 31.97
σ 23.48
int it
av 101.08
σ 55.71
Hernani Costa hercos@uma.es TIA | Granada, Spain 9 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
INTELITERM corpus
General Findings
int_en int_es int_it
050100150200250300
Common Tokens
Subcorpora
Averageofcommontokensperdocument
int_en int_es int_it
050100150200250300
Common Lemmas
Subcorpora
Averageofcommonlemmasperdocument
int_en int_es int_it
050100150200250300
Common Stems
Subcorpora
Averageofcommonstemsperdocument
scores for each subcorpus is roughly symmetric
→ data is normally distributed
distributions between the features are quite similar
→ it is possible to achieve acceptable results only using tokens
Hernani Costa hercos@uma.es TIA | Granada, Spain 10 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
INTELITERM corpus
EN vs. ES & IT
NCT per doc on average is higher + large IQR + long
whiskers + skewed left
→ data is more spread + average of NCT per doc is more
variable + wide type of docs (either highly or roughly
correlated to the rest of the docs)
→ but, in general, docs have a high degree of relatedness
between each other
Hernani Costa hercos@uma.es TIA | Granada, Spain 11 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
INTELITERM corpus
EN & IT vs. ES
From the statistical and theoretical evidences
→ NCT: high + SCC: high average scores +
χ2
: long whisker outside the upper quartile
→ EN and IT subcorpora look like they assemble highly
correlated docs
→ docs have a high degree of relatedness between each other
Is int es composed by low related docs?
Hernani Costa hercos@uma.es TIA | Granada, Spain 12 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
Measuring DSMs Performance
Goal
• How do the DSMs perform the task of filtering out docs with
a low level of relatedness?
• Set-up
inject different sets of out-of-domain docs, randomly selected
from the Europarl corpus to the INTELITERM subcorpora
Hernani Costa hercos@uma.es TIA | Granada, Spain 13 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
Measuring DSMs Performance
Average scores between docs when injecting 5%, 10%, 15% and
20% of noise
int_en05 int_en10 int_en15 int_en20 int_es05 int_es10 int_es15 int_es20 int_it05 int_it10 int_it15 int_it20
050100150200250300
Common Tokens
Subcorpora with 5%, 10%, 15% and 20% of noise
Averageofcommontokensperdocument
the more noisy docs are injected, the lower is the NCT
Next step: rank docs in a descending order according to their
DSMs scores and evaluate their precision
Hernani Costa hercos@uma.es TIA | Granada, Spain 14 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
Measuring DSMs Performance
DSMs precision when injecting different amounts of noise to the
various subcorpora
SubC Noise NCT SCC χ2
int en
5% 0.89 0.22 1.00
10% 0.73 0.33 1.00
15% 0.73 0.36 0.95
20% 0.80 0.37 0.90
int es
5% 0.00 0.00 0.38
10% 0.07 0.07 0.20
15% 0.09 0.09 0.17
20% 0.14 0.18 0.23
int it
5% 0.88 0.13 0.88
10% 0.82 0.06 0.82
15% 0.74 0.09 0.83
20% 0.73 0.13 0.87
• none of the DSMs got acceptable results for Spanish
due to the pre-existing low level of relatedness
• promising results for English and Italian
NCT and χ2
performed well
Hernani Costa hercos@uma.es TIA | Granada, Spain 15 / 20
Introduction
Methodology
Experiment
Conclusion
Corpora
Results & Analysis
Summary
From the statistical and theoretical evidences
• int en and int it
assemble highly correlated docs
• int es
scarceness of evidences only allow was to not reject the idea
that this subcorpus is composed of similar docs
• NCT & χ2
suitable for the task of filtering out low related docs with a
high precision degree
Hernani Costa hercos@uma.es TIA | Granada, Spain 16 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Current Work
Conclusion
• DSMs can be used to describe and measure the relatedness
between docs in specialised CC
three different input features were used (lists of common
tokens, lemmas and stems)
for the data in hand, these features had similar performance
for all the tested DSMs
• INTELITERM corpus seems to be composed of highly
correlated docs
high number of CE and positive average SCC and χ2
scores
Hernani Costa hercos@uma.es TIA | Granada, Spain 17 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Current Work
Current Work
• Perform more experiments with DSMs
use other languages
evaluate other DSMs (e.g. Jaccard, Lin and Cosine)
compare corpora manual with semi-automatic compiled
→ Using this approach to automatically filter out docs with a
low level of relatedness
→ will improve the precision of terminology extraction
Hernani Costa hercos@uma.es TIA | Granada, Spain 18 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Current Work
Acknowledgements
Hernani Costa is supported by the People Programme (Marie Curie Actions) of the
European Union’s Framework Programme (FP7/2007-2013) under REA grant
agreement no 317471. Also, the research reported in this work has been partially
carried out in the framework of the Educational Innovation Project TRADICOR (PIE
13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881,
2012-2015); the R&D Project for Excellence TERMITUR (ref. no HUM2754,
2014-2017); and the LATEST project (ref. 327197-FP7-PEOPLE-2012-IEF).
Hernani Costa hercos@uma.es TIA | Granada, Spain 19 / 20
Introduction
Methodology
Experiment
Conclusion
Conclusion
Current Work
Hernani Costa hercos@uma.es TIA | Granada, Spain 20 / 20

More Related Content

Viewers also liked

Marketing daily news c21 back in superbowl
Marketing daily news c21 back in superbowlMarketing daily news c21 back in superbowl
Marketing daily news c21 back in superbowlCENTURY 21
 
How to Use the C21 Social Plugin
How to Use the C21 Social PluginHow to Use the C21 Social Plugin
How to Use the C21 Social PluginCENTURY 21
 
Viatge de fi d’estudis 2013
Viatge de fi d’estudis 2013Viatge de fi d’estudis 2013
Viatge de fi d’estudis 2013Jordi Mateus
 
From distance learning and odl to o de l unisa library's journey to transform...
From distance learning and odl to o de l unisa library's journey to transform...From distance learning and odl to o de l unisa library's journey to transform...
From distance learning and odl to o de l unisa library's journey to transform...Ina Smith
 
6th weekly news
6th weekly news6th weekly news
6th weekly newssamankit
 
Webinar on webinars
Webinar on webinarsWebinar on webinars
Webinar on webinarsIna Smith
 
Desarro curricular
Desarro curricularDesarro curricular
Desarro curricularmoreliano68
 
CENTURY 21 Lemoneighborhoods
CENTURY 21 LemoneighborhoodsCENTURY 21 Lemoneighborhoods
CENTURY 21 LemoneighborhoodsCENTURY 21
 
Introduction to the Directory of Open Access Journals
Introduction to the Directory of Open Access JournalsIntroduction to the Directory of Open Access Journals
Introduction to the Directory of Open Access JournalsIna Smith
 
Take Back Control of your CPD
Take Back Control of your CPDTake Back Control of your CPD
Take Back Control of your CPDHamdi Nsir
 
Mythology Presentation
Mythology  PresentationMythology  Presentation
Mythology PresentationMONAGO
 
Open Access (OA) - Introduction
Open Access (OA) - IntroductionOpen Access (OA) - Introduction
Open Access (OA) - IntroductionIna Smith
 
Facebook VS Google plus, où investir en emarketing pour une stratégie rentable
Facebook VS Google plus, où investir en emarketing pour une stratégie rentableFacebook VS Google plus, où investir en emarketing pour une stratégie rentable
Facebook VS Google plus, où investir en emarketing pour une stratégie rentableAxel Aigret
 
Towards an Effective CPD for EL Teachers
Towards an Effective CPD for EL TeachersTowards an Effective CPD for EL Teachers
Towards an Effective CPD for EL TeachersHamdi Nsir
 
Vip First Class Turkey 2012
Vip First Class Turkey 2012Vip First Class Turkey 2012
Vip First Class Turkey 2012g_emrahay
 
C21 socialplugin configuration
C21 socialplugin configurationC21 socialplugin configuration
C21 socialplugin configurationCENTURY 21
 

Viewers also liked (20)

Ibiz center
Ibiz centerIbiz center
Ibiz center
 
Marketing daily news c21 back in superbowl
Marketing daily news c21 back in superbowlMarketing daily news c21 back in superbowl
Marketing daily news c21 back in superbowl
 
How to Use the C21 Social Plugin
How to Use the C21 Social PluginHow to Use the C21 Social Plugin
How to Use the C21 Social Plugin
 
Viatge de fi d’estudis 2013
Viatge de fi d’estudis 2013Viatge de fi d’estudis 2013
Viatge de fi d’estudis 2013
 
From distance learning and odl to o de l unisa library's journey to transform...
From distance learning and odl to o de l unisa library's journey to transform...From distance learning and odl to o de l unisa library's journey to transform...
From distance learning and odl to o de l unisa library's journey to transform...
 
6th weekly news
6th weekly news6th weekly news
6th weekly news
 
Webinar on webinars
Webinar on webinarsWebinar on webinars
Webinar on webinars
 
Desarro curricular
Desarro curricularDesarro curricular
Desarro curricular
 
Cronica cretense ed1
Cronica cretense ed1Cronica cretense ed1
Cronica cretense ed1
 
CENTURY 21 Lemoneighborhoods
CENTURY 21 LemoneighborhoodsCENTURY 21 Lemoneighborhoods
CENTURY 21 Lemoneighborhoods
 
Introduction to the Directory of Open Access Journals
Introduction to the Directory of Open Access JournalsIntroduction to the Directory of Open Access Journals
Introduction to the Directory of Open Access Journals
 
Take Back Control of your CPD
Take Back Control of your CPDTake Back Control of your CPD
Take Back Control of your CPD
 
Nmnt 2014 workshop 1
Nmnt 2014 workshop 1Nmnt 2014 workshop 1
Nmnt 2014 workshop 1
 
Mythology Presentation
Mythology  PresentationMythology  Presentation
Mythology Presentation
 
Open Access (OA) - Introduction
Open Access (OA) - IntroductionOpen Access (OA) - Introduction
Open Access (OA) - Introduction
 
Facebook VS Google plus, où investir en emarketing pour une stratégie rentable
Facebook VS Google plus, où investir en emarketing pour une stratégie rentableFacebook VS Google plus, où investir en emarketing pour une stratégie rentable
Facebook VS Google plus, où investir en emarketing pour une stratégie rentable
 
Towards an Effective CPD for EL Teachers
Towards an Effective CPD for EL TeachersTowards an Effective CPD for EL Teachers
Towards an Effective CPD for EL Teachers
 
Vip First Class Turkey 2012
Vip First Class Turkey 2012Vip First Class Turkey 2012
Vip First Class Turkey 2012
 
C21 socialplugin configuration
C21 socialplugin configurationC21 socialplugin configuration
C21 socialplugin configuration
 
Portugal
PortugalPortugal
Portugal
 

Similar to 201511-TIA_Presentation

ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreementjohn6938
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docxevonnehoggarth79783
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)TANVIRAHMED611926
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Week11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptWeek11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptKamranAli649587
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtagsECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtagsAntonio Moreno
 
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...TELKOMNIKA JOURNAL
 
Clustering Financial Time Series: How Long is Enough?
Clustering Financial Time Series: How Long is Enough?Clustering Financial Time Series: How Long is Enough?
Clustering Financial Time Series: How Long is Enough?Gautier Marti
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...niranjan kumar
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research dataAtula Ahuja
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Aliaksandr Birukou
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research dataAtula Ahuja
 
Large strain solid dynamics in OpenFOAM
Large strain solid dynamics in OpenFOAMLarge strain solid dynamics in OpenFOAM
Large strain solid dynamics in OpenFOAMJibran Haider
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 

Similar to 201511-TIA_Presentation (20)

ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreement
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Week11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptWeek11-EvaluationMethods.ppt
Week11-EvaluationMethods.ppt
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtagsECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
 
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
 
Clustering Financial Time Series: How Long is Enough?
Clustering Financial Time Series: How Long is Enough?Clustering Financial Time Series: How Long is Enough?
Clustering Financial Time Series: How Long is Enough?
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
2204.01637.pdf
2204.01637.pdf2204.01637.pdf
2204.01637.pdf
 
Large strain solid dynamics in OpenFOAM
Large strain solid dynamics in OpenFOAMLarge strain solid dynamics in OpenFOAM
Large strain solid dynamics in OpenFOAM
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 

201511-TIA_Presentation

  • 1. Introduction Methodology Experiment Conclusion Measuring the Relatedness between Documents in Comparable Corpora Hernani Costaa, Gloria Corpas Pastora and Ruslan Mitkovb {hercos,gcorpas}@uma.es, r.mitkov@wlv.ac.uk aLEXYTRAD, University of Malaga, Spain bRIILP, University of Wolverhampton, UK November, 2015 Hernani Costa hercos@uma.es TIA | Granada, Spain 1 / 20
  • 2. Introduction Methodology Experiment Conclusion Introduction Overview • Comparable Corpora (CC) automatic and assisted translation language teaching terminology • Describing, comparing and evaluating CC lack of standards • This work aims at investigating the use of Distributional Similarity Measures (DSMs) as a tool to assess CC by extracting measuring ranking Hernani Costa hercos@uma.es TIA | Granada, Spain 2 / 20
  • 3. Introduction Methodology Experiment Conclusion Introduction Motivation • An inherent problem to those who deal with CC in a daily basis is the uncertainty about the data they are dealing with tags like “casual speech transcripts” or “tourism specialised comparable corpus” are not enough to describe a corpus • Most of the resources at our disposal are built and shared without deep analysis of their content used without knowing nothing about the relatedness quality of the corpus Hernani Costa hercos@uma.es TIA | Granada, Spain 3 / 20
  • 4. Introduction Methodology Experiment Conclusion Introduction Objectives Investigate the use of textual DSMs in the context of CC • automatically measure the relatedness between docs • describe CC through the DSMs output scores • analyse which features perform better • rank docs by their degree of relatedness Hernani Costa hercos@uma.es TIA | Granada, Spain 4 / 20
  • 5. Introduction Methodology Experiment Conclusion Methodology Methodology 1) Data Preprocessing • Sentence Detector and Tokeniser – OpenNLP1 • POS tagger and lemmatisation – TT4J2 • Stemming – Snowball3 • Stopword list4 2) Identifying the list of common entities between docs • Three co-occurrence matrices common tokens, common lemmas and common stems 1 https://opennlp.apache.org 2 http://reckart.github.io/tt4j/ 3 http://snowball.tartarus.org 4 https://github.com/hpcosta/stopwords Hernani Costa hercos@uma.es TIA | Granada, Spain 5 / 20
  • 6. Introduction Methodology Experiment Conclusion Methodology Methodology 3) Computing the similarity between docs 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … • Input: list of common tokens, lemmas and stems • DSMs = {DSMCE , DSMSCC , DSMχ2 } CE: number of Common Entities SCC: Spearman’s Rank Correlation Coefficient χ2 : Chi-Square Hernani Costa hercos@uma.es TIA | Granada, Spain 6 / 20
  • 7. Introduction Methodology Experiment Conclusion Methodology Methodology 4) Computing the doc final score 110 001 111 001 000 010 001 001 111 101 111 010 100 011 101 010 111 000 001 101 110 100 011 100 … where n: total number of docs DSMi (dl , di ): the resulted similarity score between the doc dl with all the docs 5) Ranking docs • descending order according to their DSMs scores Hernani Costa hercos@uma.es TIA | Granada, Spain 7 / 20
  • 8. Introduction Methodology Experiment Conclusion Corpora Results & Analysis Corpora Statistical information about the various subcorpora nDocs types tokens types tokens int en 151 11,6k 496,2k 0.023 eur en 30 3.4k 29,8k 0.116 int es 224 13,2k 207,3k 0.063 eur es 44 5,6k 43,5k 0.129 int it 150 19,9k 386,2k 0.052 eur it 30 4,7k 29,6k 0.159 • int en, int es and int it: INTELITERM’s docs in English, Spanish and Italian • eur en, eur es and eur it: docs randomly selected from the “one per day” Europarl v.7 Hernani Costa hercos@uma.es TIA | Granada, Spain 8 / 20
  • 9. Introduction Methodology Experiment Conclusion Corpora Results & Analysis INTELITERM corpus Descriptive Statistics int_en int_es int_it 050100150200250300 Common Tokens Subcorpora Averageofcommontokensperdocument Average and standard deviation of common tokens scores between docs per subcorpus NCT int en av 163.70 σ 83.87 int es av 31.97 σ 23.48 int it av 101.08 σ 55.71 Hernani Costa hercos@uma.es TIA | Granada, Spain 9 / 20
  • 10. Introduction Methodology Experiment Conclusion Corpora Results & Analysis INTELITERM corpus General Findings int_en int_es int_it 050100150200250300 Common Tokens Subcorpora Averageofcommontokensperdocument int_en int_es int_it 050100150200250300 Common Lemmas Subcorpora Averageofcommonlemmasperdocument int_en int_es int_it 050100150200250300 Common Stems Subcorpora Averageofcommonstemsperdocument scores for each subcorpus is roughly symmetric → data is normally distributed distributions between the features are quite similar → it is possible to achieve acceptable results only using tokens Hernani Costa hercos@uma.es TIA | Granada, Spain 10 / 20
  • 11. Introduction Methodology Experiment Conclusion Corpora Results & Analysis INTELITERM corpus EN vs. ES & IT NCT per doc on average is higher + large IQR + long whiskers + skewed left → data is more spread + average of NCT per doc is more variable + wide type of docs (either highly or roughly correlated to the rest of the docs) → but, in general, docs have a high degree of relatedness between each other Hernani Costa hercos@uma.es TIA | Granada, Spain 11 / 20
  • 12. Introduction Methodology Experiment Conclusion Corpora Results & Analysis INTELITERM corpus EN & IT vs. ES From the statistical and theoretical evidences → NCT: high + SCC: high average scores + χ2 : long whisker outside the upper quartile → EN and IT subcorpora look like they assemble highly correlated docs → docs have a high degree of relatedness between each other Is int es composed by low related docs? Hernani Costa hercos@uma.es TIA | Granada, Spain 12 / 20
  • 13. Introduction Methodology Experiment Conclusion Corpora Results & Analysis Measuring DSMs Performance Goal • How do the DSMs perform the task of filtering out docs with a low level of relatedness? • Set-up inject different sets of out-of-domain docs, randomly selected from the Europarl corpus to the INTELITERM subcorpora Hernani Costa hercos@uma.es TIA | Granada, Spain 13 / 20
  • 14. Introduction Methodology Experiment Conclusion Corpora Results & Analysis Measuring DSMs Performance Average scores between docs when injecting 5%, 10%, 15% and 20% of noise int_en05 int_en10 int_en15 int_en20 int_es05 int_es10 int_es15 int_es20 int_it05 int_it10 int_it15 int_it20 050100150200250300 Common Tokens Subcorpora with 5%, 10%, 15% and 20% of noise Averageofcommontokensperdocument the more noisy docs are injected, the lower is the NCT Next step: rank docs in a descending order according to their DSMs scores and evaluate their precision Hernani Costa hercos@uma.es TIA | Granada, Spain 14 / 20
  • 15. Introduction Methodology Experiment Conclusion Corpora Results & Analysis Measuring DSMs Performance DSMs precision when injecting different amounts of noise to the various subcorpora SubC Noise NCT SCC χ2 int en 5% 0.89 0.22 1.00 10% 0.73 0.33 1.00 15% 0.73 0.36 0.95 20% 0.80 0.37 0.90 int es 5% 0.00 0.00 0.38 10% 0.07 0.07 0.20 15% 0.09 0.09 0.17 20% 0.14 0.18 0.23 int it 5% 0.88 0.13 0.88 10% 0.82 0.06 0.82 15% 0.74 0.09 0.83 20% 0.73 0.13 0.87 • none of the DSMs got acceptable results for Spanish due to the pre-existing low level of relatedness • promising results for English and Italian NCT and χ2 performed well Hernani Costa hercos@uma.es TIA | Granada, Spain 15 / 20
  • 16. Introduction Methodology Experiment Conclusion Corpora Results & Analysis Summary From the statistical and theoretical evidences • int en and int it assemble highly correlated docs • int es scarceness of evidences only allow was to not reject the idea that this subcorpus is composed of similar docs • NCT & χ2 suitable for the task of filtering out low related docs with a high precision degree Hernani Costa hercos@uma.es TIA | Granada, Spain 16 / 20
  • 17. Introduction Methodology Experiment Conclusion Conclusion Current Work Conclusion • DSMs can be used to describe and measure the relatedness between docs in specialised CC three different input features were used (lists of common tokens, lemmas and stems) for the data in hand, these features had similar performance for all the tested DSMs • INTELITERM corpus seems to be composed of highly correlated docs high number of CE and positive average SCC and χ2 scores Hernani Costa hercos@uma.es TIA | Granada, Spain 17 / 20
  • 18. Introduction Methodology Experiment Conclusion Conclusion Current Work Current Work • Perform more experiments with DSMs use other languages evaluate other DSMs (e.g. Jaccard, Lin and Cosine) compare corpora manual with semi-automatic compiled → Using this approach to automatically filter out docs with a low level of relatedness → will improve the precision of terminology extraction Hernani Costa hercos@uma.es TIA | Granada, Spain 18 / 20
  • 19. Introduction Methodology Experiment Conclusion Conclusion Current Work Acknowledgements Hernani Costa is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. Also, the research reported in this work has been partially carried out in the framework of the Educational Innovation Project TRADICOR (PIE 13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881, 2012-2015); the R&D Project for Excellence TERMITUR (ref. no HUM2754, 2014-2017); and the LATEST project (ref. 327197-FP7-PEOPLE-2012-IEF). Hernani Costa hercos@uma.es TIA | Granada, Spain 19 / 20