SlideShare a Scribd company logo
1 of 24
Download to read offline
Automatic Extraction of Arabic
       Multiword Expressions
*Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and
                    Josef van Genabith
     School of Computing, Dublin City University, Ireland
Outline
●   Introduction
●   Data Resources
●   Methodology
    ●   Crosslingual Correspondence Asymmetries
    ●   Translation-Based Approach
    ●   Corpus-Based Approach
●   Discussion of experiments and results
●   Conclusion
Introduction
●   Criteria of MWEs
    ●   Ubiquity
    ●   Diversity
    ●   Low polysemy
    ●   Statistically significant co-occurrence
●   Focus
    ●   Arabic
    ●   Nominal MWEs
●   Purpose is building an MWE lexicon for Arabic
Data Resources
✔   Multilingual, bilingual and monolingual settings
✔   Availability of rich resources that have not been
    exploited in similar tasks before.
●   Arabic Wikipedia (March 2010)
    ●   117,491 titles, of them 89,623 multiword titles
    ●   Arabic is ranked 27th according to size (article count) and
        17th according to usage
    ●   Information helpful for linguistic processing
Data Resources
●   Princeton WordNet 3.0
    ●   An electronic lexical database for English
    ●   Arabic WordNet contains only 11,269 synsets (including
        2,348 MWEs)
Data Resources
●   Arabic Gigaword
    ●   Unannotated corpus distributed by the Linguistic Data
        Consortium (LDC).
    ●   Articles from news agencies and newspapers from different
        Arab regions, such as Al-Ahram in Egypt, An Nahar in
        Lebanon and Assabah in Tunisia.
    ●   Largest publicly available corpus of Arabic to date.
    ●   Contains 848 million words.
Methodology
3 different techniques for 3 different data sources


Motivation for using different techniques
   ●   The extraction of MWEs is a problem more complex than
       can be dealt with by one simple solution.
   ●   The choice of technique depends on the nature of the task
       and the type of the resources used.
Pipeline
Technique 1: Crosslingual Asymmetries

●   Data: Titles of Wikipedia Articles in Arabic and corresponding
    titles in 21 languages.
●   Definition: We rely on many-to-one correspondence relations
●   The non-compositionality of MWEs makes it unlikely to have
    a mirrored representation in the other languages.
●   Compositionalily varies:
    ●   highly compositional, "‫" ,"قاعدة عسكرية‬military base",
    ●   with a degree of idiomaticity, such as, "‫" ,"مدينة الملهي‬amusement
        park", lit. "city of amusements".
    ●   extremely opaque , "‫" ,"فرس النبي‬grasshopper", lit. "the horse of the
        Prophet".
Technique 1: Crosslingual Asymmetries

●   Steps
    (1) Candidate Selection. All Arabic Wikipedia multiword titles
       are taken as candidates.
    (2) Filtering. We exclude titles of disambiguation and
       administrative pages.
    (3) Validation. We check if there is a single-word translation in
       any of 21 selected languages.
Technique 1: Crosslingual Asymmetries

●   Evaluation:
    ●   1100 multiword titles are randomly selected from Arabic
        Wikipedia and manually tagged as: MWEs, non-MWEs, or
        NEs.
    ●   Baseline: all multi-word titles are considered as MWEs
●   Results
Example
Language Ranking

How likely will each language give many-to-one correspondence?
Technique 2: Translation-Based

●   Data: Princeton WordNet
    ●   Assumption: MWEs in one language are likely to be
        translated as MWE in another language.
    ●   Ontological advantage
●   Steps
    ●   Extracting the list of nominal MWEs from PWN 3.0.
    ●   Translating the list into Arabic using Google Translate.
    ●   Validating the results using pure frequency counts from three
        search engines: Al-Jazeera, BBC Arabic and AWK.
Technique 2: Translation-Based

●   Evaluation (automatic)
    ●   Gold Standard: PWN-MWEs found in English Wikipedia and have
        correspondence in Arabic: 6322 expressions.
    ●   We test the Google translation without any filtering, and consider this as
        the baseline.
    ●   Then we filter the output based on the number of combined hits from the
        search engines.


●   Results
Technique 2: Translation-Based

●   Evaluation (Manual)
    ●   On 200 MWE candidates
    ●   Precision
         –   Baseline (before validation): 45.5%
         –   After validation: 83%
Technique 2: Translation-Based

●   Notes on Google Translate
    ●   Word Order
         –   shark repellent      =>        ‫القرش طارد‬
         –   accordion door       =>        ‫الكورديون الباب‬
    ●   Transferring source word to target
         –   acroclinium roseum =>             acroclinium roseum
         –   actitis hypoleucos        =>       actitis hypoleucos
Technique 3: Corpus-Based

●   Data: Arabic Gigaword corpus
●   Association Measures used:
    ●   Pointwise Mutual Information (PMI)
    ●   Pearson’s chi-square
●   Steps
    (1) Compute the frequency of all the unigrams, bigrams, and trigrams
    (2) Computing the association measures for all bigrams and trigrams (threshold to 50)
    (3) Ranking bigrams and trigrams
    (4) Conducting lemmatization of Arabic words using MADA.
    (5) Filtering the list using the MADA POS-tagger. The patterns included for bigrams are: NN NA, and for
      trigrams: NNN NNA NAA
Technique 3: Corpus-Based

●   Why is lemmatization important?
    ●   Al>mm AlmtHdp
        (the-nations united) “the United Nations”
        Al>mm@>um~ap_1@N@1#AlmtHdp@mut~aHid_1@AJ@2#


    ●   ll>mm AlmtHdp
        (to-the-nations united) “to the United Nations”
        ll>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#


    ●   wAl>mm AlmtHdp
        (and-the-nations united) “and the United Nations”
        wAl>mm@>um~ap_1@c-N@3#AlmtHdp@mut~aHid_1@AJ@3#


    ●   bAl>mm AlmtHdp
        (by-the-nations united) “by the United Nations”
        bAl>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
Technique 3: Corpus-Based

●   Evaluation: 3600 expressions are randomly selected
    and classified into MWE or non-MWE by a human
    annotator.
●   Results
Discussion results
●   Combination of yields
Discussion of results
●   Similarities and dissimilarities of output
The set of collocations detected by the association
measures may differ from the those which capture the
interest of lexicographers and Wikipedians
    ●   ‫مناحم مازوز‬     “Menachem Mazuz”
    ●   ‫خضروات طازجة‬    “fresh fruits”
    ●   ‫سيداتي وسادتي‬   “Ladies and gentlemen”
Conclusion
●   Applicability to other languages
●   the heterogeneity of the data sources helps to enrich
    the MWE lexicon.
●   A lexical resource of:
    ●   33,000 MWEs
    ●   39,000 NEs
Thank you!

More Related Content

Similar to Arabic mwe presentation 07

P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation iwan_rg
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentationMohammed Attia
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translationArabic_NLP_ImamU2013
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
Computer programing 111 lecture 1
Computer programing 111 lecture 1 Computer programing 111 lecture 1
Computer programing 111 lecture 1 ITNet
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Ahmed Magdy Ezzeldin, MSc.
 
machinetranslation-161223011433.pptx
machinetranslation-161223011433.pptxmachinetranslation-161223011433.pptx
machinetranslation-161223011433.pptxDrBaiti1
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
02 15034 neural network
02 15034 neural network02 15034 neural network
02 15034 neural networkIAESIJEECS
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech TranslationIRJET Journal
 

Similar to Arabic mwe presentation 07 (20)

P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
E lex presentation_03
E lex presentation_03E lex presentation_03
E lex presentation_03
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentation
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
Computer programing 111 lecture 1
Computer programing 111 lecture 1 Computer programing 111 lecture 1
Computer programing 111 lecture 1
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
subrat
 subrat subrat
subrat
 
machinetranslation-161223011433.pptx
machinetranslation-161223011433.pptxmachinetranslation-161223011433.pptx
machinetranslation-161223011433.pptx
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
02 15034 neural network
02 15034 neural network02 15034 neural network
02 15034 neural network
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech Translation
 

More from Mohammed Attia

Teacher training course
Teacher training courseTeacher training course
Teacher training courseMohammed Attia
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activitiesMohammed Attia
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01Mohammed Attia
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04Mohammed Attia
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02Mohammed Attia
 

More from Mohammed Attia (6)

Assertiveness skills
Assertiveness skillsAssertiveness skills
Assertiveness skills
 
Teacher training course
Teacher training courseTeacher training course
Teacher training course
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activities
 
Arabic language presentation 01
Arabic language presentation 01Arabic language presentation 01
Arabic language presentation 01
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02
 

Arabic mwe presentation 07

  • 1. Automatic Extraction of Arabic Multiword Expressions *Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and Josef van Genabith School of Computing, Dublin City University, Ireland
  • 2. Outline ● Introduction ● Data Resources ● Methodology ● Crosslingual Correspondence Asymmetries ● Translation-Based Approach ● Corpus-Based Approach ● Discussion of experiments and results ● Conclusion
  • 3. Introduction ● Criteria of MWEs ● Ubiquity ● Diversity ● Low polysemy ● Statistically significant co-occurrence ● Focus ● Arabic ● Nominal MWEs ● Purpose is building an MWE lexicon for Arabic
  • 4. Data Resources ✔ Multilingual, bilingual and monolingual settings ✔ Availability of rich resources that have not been exploited in similar tasks before. ● Arabic Wikipedia (March 2010) ● 117,491 titles, of them 89,623 multiword titles ● Arabic is ranked 27th according to size (article count) and 17th according to usage ● Information helpful for linguistic processing
  • 5. Data Resources ● Princeton WordNet 3.0 ● An electronic lexical database for English ● Arabic WordNet contains only 11,269 synsets (including 2,348 MWEs)
  • 6. Data Resources ● Arabic Gigaword ● Unannotated corpus distributed by the Linguistic Data Consortium (LDC). ● Articles from news agencies and newspapers from different Arab regions, such as Al-Ahram in Egypt, An Nahar in Lebanon and Assabah in Tunisia. ● Largest publicly available corpus of Arabic to date. ● Contains 848 million words.
  • 7. Methodology 3 different techniques for 3 different data sources Motivation for using different techniques ● The extraction of MWEs is a problem more complex than can be dealt with by one simple solution. ● The choice of technique depends on the nature of the task and the type of the resources used.
  • 9. Technique 1: Crosslingual Asymmetries ● Data: Titles of Wikipedia Articles in Arabic and corresponding titles in 21 languages. ● Definition: We rely on many-to-one correspondence relations ● The non-compositionality of MWEs makes it unlikely to have a mirrored representation in the other languages. ● Compositionalily varies: ● highly compositional, "‫" ,"قاعدة عسكرية‬military base", ● with a degree of idiomaticity, such as, "‫" ,"مدينة الملهي‬amusement park", lit. "city of amusements". ● extremely opaque , "‫" ,"فرس النبي‬grasshopper", lit. "the horse of the Prophet".
  • 10. Technique 1: Crosslingual Asymmetries ● Steps (1) Candidate Selection. All Arabic Wikipedia multiword titles are taken as candidates. (2) Filtering. We exclude titles of disambiguation and administrative pages. (3) Validation. We check if there is a single-word translation in any of 21 selected languages.
  • 11. Technique 1: Crosslingual Asymmetries ● Evaluation: ● 1100 multiword titles are randomly selected from Arabic Wikipedia and manually tagged as: MWEs, non-MWEs, or NEs. ● Baseline: all multi-word titles are considered as MWEs ● Results
  • 13. Language Ranking How likely will each language give many-to-one correspondence?
  • 14. Technique 2: Translation-Based ● Data: Princeton WordNet ● Assumption: MWEs in one language are likely to be translated as MWE in another language. ● Ontological advantage ● Steps ● Extracting the list of nominal MWEs from PWN 3.0. ● Translating the list into Arabic using Google Translate. ● Validating the results using pure frequency counts from three search engines: Al-Jazeera, BBC Arabic and AWK.
  • 15. Technique 2: Translation-Based ● Evaluation (automatic) ● Gold Standard: PWN-MWEs found in English Wikipedia and have correspondence in Arabic: 6322 expressions. ● We test the Google translation without any filtering, and consider this as the baseline. ● Then we filter the output based on the number of combined hits from the search engines. ● Results
  • 16. Technique 2: Translation-Based ● Evaluation (Manual) ● On 200 MWE candidates ● Precision – Baseline (before validation): 45.5% – After validation: 83%
  • 17. Technique 2: Translation-Based ● Notes on Google Translate ● Word Order – shark repellent => ‫القرش طارد‬ – accordion door => ‫الكورديون الباب‬ ● Transferring source word to target – acroclinium roseum => acroclinium roseum – actitis hypoleucos => actitis hypoleucos
  • 18. Technique 3: Corpus-Based ● Data: Arabic Gigaword corpus ● Association Measures used: ● Pointwise Mutual Information (PMI) ● Pearson’s chi-square ● Steps (1) Compute the frequency of all the unigrams, bigrams, and trigrams (2) Computing the association measures for all bigrams and trigrams (threshold to 50) (3) Ranking bigrams and trigrams (4) Conducting lemmatization of Arabic words using MADA. (5) Filtering the list using the MADA POS-tagger. The patterns included for bigrams are: NN NA, and for trigrams: NNN NNA NAA
  • 19. Technique 3: Corpus-Based ● Why is lemmatization important? ● Al>mm AlmtHdp (the-nations united) “the United Nations” Al>mm@>um~ap_1@N@1#AlmtHdp@mut~aHid_1@AJ@2# ● ll>mm AlmtHdp (to-the-nations united) “to the United Nations” ll>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3# ● wAl>mm AlmtHdp (and-the-nations united) “and the United Nations” wAl>mm@>um~ap_1@c-N@3#AlmtHdp@mut~aHid_1@AJ@3# ● bAl>mm AlmtHdp (by-the-nations united) “by the United Nations” bAl>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#
  • 20. Technique 3: Corpus-Based ● Evaluation: 3600 expressions are randomly selected and classified into MWE or non-MWE by a human annotator. ● Results
  • 21. Discussion results ● Combination of yields
  • 22. Discussion of results ● Similarities and dissimilarities of output The set of collocations detected by the association measures may differ from the those which capture the interest of lexicographers and Wikipedians ● ‫مناحم مازوز‬ “Menachem Mazuz” ● ‫خضروات طازجة‬ “fresh fruits” ● ‫سيداتي وسادتي‬ “Ladies and gentlemen”
  • 23. Conclusion ● Applicability to other languages ● the heterogeneity of the data sources helps to enrich the MWE lexicon. ● A lexical resource of: ● 33,000 MWEs ● 39,000 NEs