This document summarizes a study that investigates using distributional similarity measures (DSMs) to assess the relatedness between documents in comparable corpora. The methodology involves preprocessing text, identifying common entities between documents, computing similarity scores using DSMs, and analyzing the results. The experiment uses subcorpora from the INTELITERM corpus. The results show the DSMs can describe the relatedness between documents in the corpora, with original documents generally having higher relatedness than translated documents. Future work could involve additional experiments and languages.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
This was my entities and fields presentation from Drupalcamp Colorado 2020.
http://drupalcampcolorado.org/sessions/drupal-7-entities-and-fields-transitioning-d7
This was my entities and fields presentation from Drupalcamp Colorado 2020.
http://drupalcampcolorado.org/sessions/drupal-7-entities-and-fields-transitioning-d7
Language variety identification aims at labelling texts in a native lan- guage (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Ar- gentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ∼35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show com- petitive performance while dramatically reducing the dimensionality — and in- creasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
Ponencia presentada sobre en el XXIII Curso Anual de GRETA en la que se apunta la necesidad de desarrollar distintos modelos de investigación para hacer avanzar el Plan de Fomento de Plurilingüismo.
El Curso se celebró en la Universidad de Jaén, los días 17, 18 y 19 de septiembre.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
Revised version of design, production, application and analysis of tblt march...Clara Clavijo Encalada
This research work emerged from the need to create a handbook for Hospitaity Students at the University of Cuenca since there is a lack of didatic material which will aid participants to acquire the target language to become embassadors of their home country and to provide useful information in the areas of Tourism, Hotel Industrty and Gastronomy.
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
In this talk I'll start by introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released by Hugging Face, in particular our transformers, tokenizers, and NLP libraries as well as our distilled and pruned models.
Ph.D. Methodology, Procedures and BehaviourAna Loureiro
Porta, M. & Loureiro, A. (2012). Ph.D. Methodology, Procedures and Behaviour. Workshop dinamizado na 8th Joint Summer School on Technology Enhanced Learning 2012 - 21 Maio de 2012, Estoril, Portugal - uma iniciativa da STELLAR Network of Excellence in Technology Enhanced Learning and European Association of Technology Enhanced Learning (EATEL).
2. Introduction
Methodology
Experiment
Conclusion
Introduction
Overview
• Comparable Corpora (CC) is considered an important resource
in many research areas
automatic and assisted translation
language teaching
terminology
• Describing, comparing and evaluating CC are key issues in
these areas for which there is still a notable lack of standards
• Bearing this in mind, this work aims at investigating the use
of textual Distributional Similarity Measures (DSMs) as a tool
to assess the relatedness between docs in CC
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 2 / 20
3. Introduction
Methodology
Experiment
Conclusion
Introduction
Motivation
• An inherent problem to those who deal with CC in a daily
basis is the uncertainty about the data they are dealing with
usually, tags like “casual speech transcripts” or “tourism
specialised comparable corpus” are not enough to describe a
corpus
• Most of the resources at our disposal are
built and shared without deep analysis of their content
without knowing nothing about the relatedness quality of the
corpus
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 3 / 20
5. Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
1) Data Preprocessing
• Sentence Detector and Tokeniser – OpenNLP1
• POS tagger and lemmatisation – TT4J2
• Stemming – Snowball3
• Stopword list4
2) Identifying the list of common entities between docs
• Three co-occurrence matrices
common tokens, common lemmas and common stems
1
https://opennlp.apache.org
2
http://reckart.github.io/tt4j/
3
http://snowball.tartarus.org
4
https://github.com/hpcosta/stopwords
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 5 / 20
6. Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
3) Computing the similarity between docs
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
• Input: list of common tokens, lemmas and stems
• DSMs = {DSMNCE , DSMSCC , DSMχ2 }
NCE: Number of Common Entities
SCC: Spearman’s Rank Correlation Coefficient
χ2
: Chi-Square
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 6 / 20
7. Introduction
Methodology
Experiment
Conclusion
Methodology
Methodology
4) Computing the doc final score
110 001
111 001
000 010
001 001
111 101
111 010
100 011
101 010
111 000
001 101
110 100
011 100
…
where
n: total number of docs
DSMi (dl , di ): the resulted similarity score between the doc dl with all the
docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 7 / 20
8. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
The INTELITERM Corpus
Statistical information about the various INTELITERM5
subcorpora
nDocs types tokens types
tokens description
en to 151 11,6k 508,9k 0.023 original
en totd 61 6,9k 88,5k 0.078 translated
es to 225 12,6k 253,4k 0.049 original
es totd 27 3,4k 19,7k 0.174 translated
5
http://www.lexytrad.es/proyectos.html
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 8 / 20
9. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Results
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
Average and standard deviation of
common tokens scores between doc
per subcorpus
NCT
en to
av 163.70
σ 83.87
en totd
av 67.54
σ 35.35
es to
av 31.97
σ 23.48
es totd
av 17.93
σ 8.46
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 9 / 20
10. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
General Findings
en_to en_totd es_to es_totd
050100150200250300
Common Tokens
Subcorpus
Averageofcommontokensperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Lemmas
Subcorpus
Averageofcommonlemmasperdocument
en_to en_totd es_to es_totd
050100150200250300
Common Stemms
Subcorpus
Averageofcommonstemmsperdocument
distributions between the features are quite similar
→ it is possible to achieve acceptable results only using tokens
scores for each subcorpus is roughly symmetric
→ data is normally distributed
original docs vs. translated docs: original docs high NCE
→ different translators use different vocabulary and
consequently lower the NCE between the docs will be
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 10 / 20
11. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT per doc on average is higher + large IQR + long whiskers
→ data is more spread + average of NCT per doc is more
variable + wide type of docs (either highly or roughly
correlated to the rest of the docs)
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 11 / 20
12. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Original Docs (en to)
NCT: data is skewed right + SCC: high average scores +
χ2
: long whisker outside the upper quartile
→ docs have a high degree of relatedness between each other
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 12 / 20
13. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
English - Translated Docs (en totd)
the NCT, the SCC and the χ2
scores suggest that the data is
either normally distributed or skewed left
→ docs are highly related
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 13 / 20
14. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish vs. English
lower NCT when compared with en to and en totd
→ Spanish has richer morphology (bigger number of inflection
forms per lemma)
→ less common tokens per doc
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 14 / 20
15. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Spanish - Translated Docs (es totd)
NCT: low + SCC: data varies inside and outside the IQR +
χ2
: σ higher than its av.
→ inconstancy in the data
− subcorpus size (27 docs)
− low number of types & tokens
− high types
tokens
ratio suggests that a more diverse form of language is employed
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 15 / 20
16. Introduction
Methodology
Experiment
Conclusion
The Corpus
Results & Analysis
Summary
From the statistical and theoretical evidences
• en to, en totd and es to
assemble highly correlated docs
• es totd
the small number of docs and the scarceness of evidences only
allow was to not reject the idea that this subcorpus is
composed of similar docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 16 / 20
17. Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Conclusion
• This work presents and studies various DSMs for the purpose
of describing specialised CC
• As input for these DSMs, we used three different input
features (lists of common tokens, lemmas and stems)
for the data in hand, these features had similar performance
for all the tested DSMs
• The high number of entities shared by its docs, the positive
average scores obtained with the SCC measure and their χ2
scores sustain that the INTELITERM corpus is composed of
highly correlated docs
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 17 / 20
18. Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Future Work
• Perform more experiments with DSMs
use other languages and analyse if translated docs always
decrease the general relatedness score
evaluate other DSMs (e.g. Jaccard, Lin and Cosine)
add noisy docs (i.e. out of topic docs) to the corpus and
analyse the DSMs performance
→ Use this approach to automatically filter out docs with a
low level of relatedness
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 18 / 20
19. Introduction
Methodology
Experiment
Conclusion
Conclusion
Future Work
Acknowledgements
Hernani Costa is supported by the People Programme (Marie Curie Actions) of the
European Union’s Framework Programme (FP7/2007-2013) under REA grant
agreement no 317471. Also, the research reported in this work has been partially
carried out in the framework of the Educational Innovation Project TRADICOR (PIE
13-054, 2014-2015); the R&D project INTELITERM (ref. no FFI2012-38881,
2012-2015); and the R&D Project for Excellence TERMITUR (ref. no HUM2754,
2014-2017). I would like to thank Prof. Gloria Corpas Pastor, Prof. Ruslan Mitkov and
Dr. Miriam Seghiri for their valuable comments and suggestions to improve this work.
Hernani Costa hercos@uma.es EXPERT-STW | Malaga, Spain 19 / 20