This document describes three techniques for automatically extracting multiword expressions from Arabic texts:
1. Using crosslingual correspondences between Arabic Wikipedia titles and their translations in other languages, assuming MWEs are less likely to have one-to-one translations.
2. Translating nominal MWEs from Princeton WordNet into Arabic using Google Translate and validating using search engine frequency counts.
3. Applying association measures like PMI and chi-square to n-grams in the Arabic Gigaword corpus after lemmatization and POS filtering.
The combination of techniques utilizing multilingual data, dictionaries and corpora enriched the extracted Arabic MWE lexicon with over 33,000 MWEs and 39,000
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for L...Efsun Kayi
We release an urgency dataset that consists of English tweets relating to natural crises. The set is annotated along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train a variety of classifier architectures, supervised and semi supervised, on the extracted features. We also further experiment with ensembling the various classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for the English urgency detection task, with 25% F1 score absolute improvement over the baselines. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for L...Efsun Kayi
We release an urgency dataset that consists of English tweets relating to natural crises. The set is annotated along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train a variety of classifier architectures, supervised and semi supervised, on the extracted features. We also further experiment with ensembling the various classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for the English urgency detection task, with 25% F1 score absolute improvement over the baselines. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation iwan_rg
By:
Wajdi Zaghouani and Dana Awad
Abstract
We present our effort to build a large scale punctuated corpus for Arabic. We illustrate in details our punctuation annotation guidelines designed to improve the annotation work flow and the inter-annotator agreement. We summarize the guidelines created, discuss the annotation framework and show the Arabic punctuation peculiarities. Our guidelines were used by trained annotators and regular inter-annotator agreement measures were performed to ensure the annotation quality. We highlight the main difficulties related to the Arabic punctuation annotation that arose during this project.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
This presentation was provided by William Mattingly of the Smithsonian Institution, for the fifth session of NISO's 2023 Training Series on Text and Data Mining. Session five, "Text Processing for Library Data" was held on Thursday, November 9, 2023.
An engaging workshop intended to showcase community efforts to implement LGR Procedure for current and potential Generation Panel members. The workshop will also discuss how Generation Panels of related scripts should coordinate with each other going forward.
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation iwan_rg
By:
Wajdi Zaghouani and Dana Awad
Abstract
We present our effort to build a large scale punctuated corpus for Arabic. We illustrate in details our punctuation annotation guidelines designed to improve the annotation work flow and the inter-annotator agreement. We summarize the guidelines created, discuss the annotation framework and show the Arabic punctuation peculiarities. Our guidelines were used by trained annotators and regular inter-annotator agreement measures were performed to ensure the annotation quality. We highlight the main difficulties related to the Arabic punctuation annotation that arose during this project.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
This presentation was provided by William Mattingly of the Smithsonian Institution, for the fifth session of NISO's 2023 Training Series on Text and Data Mining. Session five, "Text Processing for Library Data" was held on Thursday, November 9, 2023.
An engaging workshop intended to showcase community efforts to implement LGR Procedure for current and potential Generation Panel members. The workshop will also discuss how Generation Panels of related scripts should coordinate with each other going forward.
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
1. Automatic Extraction of Arabic
Multiword Expressions
*Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and
Josef van Genabith
School of Computing, Dublin City University, Ireland
2. Outline
● Introduction
● Data Resources
● Methodology
● Crosslingual Correspondence Asymmetries
● Translation-Based Approach
● Corpus-Based Approach
● Discussion of experiments and results
● Conclusion
3. Introduction
● Criteria of MWEs
● Ubiquity
● Diversity
● Low polysemy
● Statistically significant co-occurrence
● Focus
● Arabic
● Nominal MWEs
● Purpose is building an MWE lexicon for Arabic
4. Data Resources
✔ Multilingual, bilingual and monolingual settings
✔ Availability of rich resources that have not been
exploited in similar tasks before.
● Arabic Wikipedia (March 2010)
● 117,491 titles, of them 89,623 multiword titles
● Arabic is ranked 27th according to size (article count) and
17th according to usage
● Information helpful for linguistic processing
5. Data Resources
● Princeton WordNet 3.0
● An electronic lexical database for English
● Arabic WordNet contains only 11,269 synsets (including
2,348 MWEs)
6. Data Resources
● Arabic Gigaword
● Unannotated corpus distributed by the Linguistic Data
Consortium (LDC).
● Articles from news agencies and newspapers from different
Arab regions, such as Al-Ahram in Egypt, An Nahar in
Lebanon and Assabah in Tunisia.
● Largest publicly available corpus of Arabic to date.
● Contains 848 million words.
7. Methodology
3 different techniques for 3 different data sources
Motivation for using different techniques
● The extraction of MWEs is a problem more complex than
can be dealt with by one simple solution.
● The choice of technique depends on the nature of the task
and the type of the resources used.
9. Technique 1: Crosslingual Asymmetries
● Data: Titles of Wikipedia Articles in Arabic and corresponding
titles in 21 languages.
● Definition: We rely on many-to-one correspondence relations
● The non-compositionality of MWEs makes it unlikely to have
a mirrored representation in the other languages.
● Compositionalily varies:
● highly compositional, "" ,"قاعدة عسكريةmilitary base",
● with a degree of idiomaticity, such as, "" ,"مدينة الملهيamusement
park", lit. "city of amusements".
● extremely opaque , "" ,"فرس النبيgrasshopper", lit. "the horse of the
Prophet".
10. Technique 1: Crosslingual Asymmetries
● Steps
(1) Candidate Selection. All Arabic Wikipedia multiword titles
are taken as candidates.
(2) Filtering. We exclude titles of disambiguation and
administrative pages.
(3) Validation. We check if there is a single-word translation in
any of 21 selected languages.
11. Technique 1: Crosslingual Asymmetries
● Evaluation:
● 1100 multiword titles are randomly selected from Arabic
Wikipedia and manually tagged as: MWEs, non-MWEs, or
NEs.
● Baseline: all multi-word titles are considered as MWEs
● Results
14. Technique 2: Translation-Based
● Data: Princeton WordNet
● Assumption: MWEs in one language are likely to be
translated as MWE in another language.
● Ontological advantage
● Steps
● Extracting the list of nominal MWEs from PWN 3.0.
● Translating the list into Arabic using Google Translate.
● Validating the results using pure frequency counts from three
search engines: Al-Jazeera, BBC Arabic and AWK.
15. Technique 2: Translation-Based
● Evaluation (automatic)
● Gold Standard: PWN-MWEs found in English Wikipedia and have
correspondence in Arabic: 6322 expressions.
● We test the Google translation without any filtering, and consider this as
the baseline.
● Then we filter the output based on the number of combined hits from the
search engines.
● Results
17. Technique 2: Translation-Based
● Notes on Google Translate
● Word Order
– shark repellent => القرش طارد
– accordion door => الكورديون الباب
● Transferring source word to target
– acroclinium roseum => acroclinium roseum
– actitis hypoleucos => actitis hypoleucos
18. Technique 3: Corpus-Based
● Data: Arabic Gigaword corpus
● Association Measures used:
● Pointwise Mutual Information (PMI)
● Pearson’s chi-square
● Steps
(1) Compute the frequency of all the unigrams, bigrams, and trigrams
(2) Computing the association measures for all bigrams and trigrams (threshold to 50)
(3) Ranking bigrams and trigrams
(4) Conducting lemmatization of Arabic words using MADA.
(5) Filtering the list using the MADA POS-tagger. The patterns included for bigrams are: NN NA, and for
trigrams: NNN NNA NAA
19. Technique 3: Corpus-Based
● Why is lemmatization important?
● Al>mm AlmtHdp
(the-nations united) “the United Nations”
Al>mm@>um~ap_1@N@1#AlmtHdp@mut~aHid_1@AJ@2#
● ll>mm AlmtHdp
(to-the-nations united) “to the United Nations”
ll>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
● wAl>mm AlmtHdp
(and-the-nations united) “and the United Nations”
wAl>mm@>um~ap_1@c-N@3#AlmtHdp@mut~aHid_1@AJ@3#
● bAl>mm AlmtHdp
(by-the-nations united) “by the United Nations”
bAl>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
20. Technique 3: Corpus-Based
● Evaluation: 3600 expressions are randomly selected
and classified into MWE or non-MWE by a human
annotator.
● Results
22. Discussion of results
● Similarities and dissimilarities of output
The set of collocations detected by the association
measures may differ from the those which capture the
interest of lexicographers and Wikipedians
● مناحم مازوز “Menachem Mazuz”
● خضروات طازجة “fresh fruits”
● سيداتي وسادتي “Ladies and gentlemen”
23. Conclusion
● Applicability to other languages
● the heterogeneity of the data sources helps to enrich
the MWE lexicon.
● A lexical resource of:
● 33,000 MWEs
● 39,000 NEs