Lexical Profiling for Arabic


Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia
         Tounsi, Josef van Genabith

    National Centre for Language Technology (NCLT),
       School of Computing, Dublin City University

                          Funded by:
   Enterprise Ireland, the Irish Research Council for Science
         Engineering and Technology (IRCSET), and
        the EU projects PANACEA and META-NET
Overview
• Introduction
• Building the lexical database for Arabic
  – Corpus-based Selection of Entries
  – Morphological Details: Inflectional Paradigms
  – Syntactic Details: Subcategorization Frames
• Web Application
• Conclusion
Introduction
• Modern Standard Arabic vs. Classical
  Arabic
• Current State of Arabic Lexicography
  – Lexicons are not corpus-based
  – Buckwalter Electronic Dictionary and Arabic
    Morphological Analyser
  – No lexica for subcategorization frames
• Importance of Lexical Resources
Introduction
• Arabic Morphotactics
Aim
• Constructing a lexical database of Modern
  Standard Arabic
• Constructing a database for Arabic
  subcategorization frames
Methodology
Lexical Details
• Using a medium-scale manually created lexicon of
  10,799 lemmas
• Using statistics from a 1 billion word corpus (annotated
  by MADA)
   – 90% from the LDC's Arabic Gigaword
   – 10% collected from the Al-Jazeera website
Subcategorization Details
• Using a medium-scale manually created lexicon of 2,901
  lemma-frame types
• Using the Penn Arabic Treebank of 22,524 sentences,
  and 587,665 words
Extending the Lexical Database

• Start-off with a seed lexicon
  – Three Lexical Databases, manually constructed
     • 5,925 nominal lemmas, with details on:
        – Gender and number
        – Inflection paradigm (13 continuation classes)
        – Humanness
     • 1,529 verb lemmas, with details on:
        – Transitivity
        – Whether passive is allowed or not
        – Whether the imperative is allowed or not
     • 490 patterns (456 for nominals and 34 for verbs)
     • lemma-root look up database
Methodology
Extending the Lexical Database

• Automatically Extending the Lexical
  Database: Lexical Enrichment
  – Data-driven filtering technique
     • 40,648 lemmas (in Buckwalter or SAMA 3.1)
     • Statistics from three web search engines
     • Statistics from the corpus annotated by MADA
     • 29,627 lemmas (left after filtering)
Extending the Lexical Database

Automatically Extending the Lexical
Database: Feature Enrichment
  – Machine Learning
  – Multilayer Peceptron classification algorithm
  – Training Data: 4,816 nominals and 1,448 verbs
  – Classes for nominals: continuation classes (or inflection
    paths), the semantico-grammatical feature of humanness,
    and POS (noun or adjective)
  – Classes for verbs: transitivity, allowing the passive voice,
    and allowing the imperative mood
  – We feed these datasets with frequency statistics from the
    corpus and build a vector grid.
Extending the Lexical Database

• Extending the Lexical Database
  – Feature enrichment using Machine Learning
Extending the Lexical Database

• Extending the Lexical Database
  – With Machine Learning we add:
   
     18,000 new lemmas:
    
      12,974 nominals
    
      5,034 verbs
Extending the Lexical Database

• Handling Broken Plurals
                 jAnib      (side)
                 jawAnib    (sides)

Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc>
    <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc>
   <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss
Extending the Lexical Database

• Extracting Broken Plurals
<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>


  We use Levenshtein Distance which measures the difference
between two strings (here glosses having the same lemmaID).

     distance of 2 / length of the first string = 0.15
     (within the threshold 0.4)


    We collect 2,266 candidates
Extending the Lexical Database

• Validating Broken Plurals
<voc>jAnib</voc>         singular
                         pattern is: fAEil
                         regex is: .A.i.
<voc>jawAnib</voc> plural
                         pattern is: fawAEil
                         regex is: .awA.i.
Pattern database: 135 singular patterns that choose from a
set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)
Extending the Lexical Database

• Interesting statistics on Arabic plurals
Insights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms
Extraction of Subcat Frames

• Importance of subcategorization frames

• Advantage of Automatic Extraction

• Available Resource on Arabic Subcat
Frames:
   – none except Arabic LFG Parser (Attia,
     2008) - available as open source
Extraction of Subcat Frames

What are LFG subcat frames?
      
        Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP
        and XCOMP)
      
        Non-governable GFs (ADJ and XADJ)

π<gf1,gf2,…gfn>

{iEotamada Al-Tifolu EalaY wAlidati-hi
“The child relied on his mother”

      {iEotamada<(↑SUBJ)( ↑OBL>alaY)>
Extraction of Subcat Frames

Automatic extraction of subcat frames
    
      The ATB contains 22,524 sentences
    
      LFG Annotation algorithm (DCU)
    
      Traversing trees and looking for dependencies.
    
      Lemmatization
    
      We extract
    
      7,746 lemma-frame types (for verbs, nouns and
      adjectives)
Extraction of Subcat Frames

Estimating the Subcategorization Probability
Extraction of Subcat Frames

Evaluation the Subcategorization Extraction
Extraction of Subcat Frames

Evaluation the Subcategorization Extraction
Web Application
• AraComLex Lexicon Writing Application
         www.cngl.ie/aracomlex
Byproducts of the Work
A number of open-source Resources:
• finite-state morphological transducer

   Arabic morphological patterns

   Subcategorization frames

   Arabic lemma frequency counts
Conclusion
• We successfully use machine learning to predict
  morpho-syntactic features for newly acquired
  words.
• We successfully extract subcategorization
  frames from the Penn Arabic Treebank
• We build specifications and implementation for
  an Arabic lexicographic web application.

E lex presentation_03

  • 1.
    Lexical Profiling forArabic Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith National Centre for Language Technology (NCLT), School of Computing, Dublin City University Funded by: Enterprise Ireland, the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA and META-NET
  • 2.
    Overview • Introduction • Buildingthe lexical database for Arabic – Corpus-based Selection of Entries – Morphological Details: Inflectional Paradigms – Syntactic Details: Subcategorization Frames • Web Application • Conclusion
  • 3.
    Introduction • Modern StandardArabic vs. Classical Arabic • Current State of Arabic Lexicography – Lexicons are not corpus-based – Buckwalter Electronic Dictionary and Arabic Morphological Analyser – No lexica for subcategorization frames • Importance of Lexical Resources
  • 4.
  • 5.
    Aim • Constructing alexical database of Modern Standard Arabic • Constructing a database for Arabic subcategorization frames
  • 6.
    Methodology Lexical Details • Usinga medium-scale manually created lexicon of 10,799 lemmas • Using statistics from a 1 billion word corpus (annotated by MADA) – 90% from the LDC's Arabic Gigaword – 10% collected from the Al-Jazeera website Subcategorization Details • Using a medium-scale manually created lexicon of 2,901 lemma-frame types • Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words
  • 7.
    Extending the LexicalDatabase • Start-off with a seed lexicon – Three Lexical Databases, manually constructed • 5,925 nominal lemmas, with details on: – Gender and number – Inflection paradigm (13 continuation classes) – Humanness • 1,529 verb lemmas, with details on: – Transitivity – Whether passive is allowed or not – Whether the imperative is allowed or not • 490 patterns (456 for nominals and 34 for verbs) • lemma-root look up database
  • 8.
  • 9.
    Extending the LexicalDatabase • Automatically Extending the Lexical Database: Lexical Enrichment – Data-driven filtering technique • 40,648 lemmas (in Buckwalter or SAMA 3.1) • Statistics from three web search engines • Statistics from the corpus annotated by MADA • 29,627 lemmas (left after filtering)
  • 10.
    Extending the LexicalDatabase Automatically Extending the Lexical Database: Feature Enrichment – Machine Learning – Multilayer Peceptron classification algorithm – Training Data: 4,816 nominals and 1,448 verbs – Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective) – Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood – We feed these datasets with frequency statistics from the corpus and build a vector grid.
  • 11.
    Extending the LexicalDatabase • Extending the Lexical Database – Feature enrichment using Machine Learning
  • 12.
    Extending the LexicalDatabase • Extending the Lexical Database – With Machine Learning we add:  18,000 new lemmas:  12,974 nominals  5,034 verbs
  • 13.
    Extending the LexicalDatabase • Handling Broken Plurals jAnib (side) jawAnib (sides) Poor handling of broken plural in Buckwalter (4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss> (5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss> Two differences: voc and gloss
  • 14.
    Extending the LexicalDatabase • Extracting Broken Plurals <gloss>side/aspect</gloss> <gloss>sides/aspects</gloss>  We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID). distance of 2 / length of the first string = 0.15 (within the threshold 0.4)  We collect 2,266 candidates
  • 15.
    Extending the LexicalDatabase • Validating Broken Plurals <voc>jAnib</voc> singular pattern is: fAEil regex is: .A.i. <voc>jawAnib</voc> plural pattern is: fawAEil regex is: .awA.i. Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns 2,266 candidates -> 1,965 are validated (87%)
  • 16.
    Extending the LexicalDatabase • Interesting statistics on Arabic plurals Insights from the corpus: 5,570 lemmas have a feminine plural suffix 1,942 lemmas have a masculine plural suffix 2,730 lemmas with a broken plural forms
  • 17.
    Extraction of SubcatFrames • Importance of subcategorization frames • Advantage of Automatic Extraction • Available Resource on Arabic Subcat Frames: – none except Arabic LFG Parser (Attia, 2008) - available as open source
  • 18.
    Extraction of SubcatFrames What are LFG subcat frames?  Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP and XCOMP)  Non-governable GFs (ADJ and XADJ) π<gf1,gf2,…gfn> {iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother” {iEotamada<(↑SUBJ)( ↑OBL>alaY)>
  • 19.
    Extraction of SubcatFrames Automatic extraction of subcat frames  The ATB contains 22,524 sentences  LFG Annotation algorithm (DCU)  Traversing trees and looking for dependencies.  Lemmatization  We extract  7,746 lemma-frame types (for verbs, nouns and adjectives)
  • 20.
    Extraction of SubcatFrames Estimating the Subcategorization Probability
  • 21.
    Extraction of SubcatFrames Evaluation the Subcategorization Extraction
  • 22.
    Extraction of SubcatFrames Evaluation the Subcategorization Extraction
  • 23.
    Web Application • AraComLexLexicon Writing Application www.cngl.ie/aracomlex
  • 24.
    Byproducts of theWork A number of open-source Resources: • finite-state morphological transducer  Arabic morphological patterns  Subcategorization frames  Arabic lemma frequency counts
  • 25.
    Conclusion • We successfullyuse machine learning to predict morpho-syntactic features for newly acquired words. • We successfully extract subcategorization frames from the Penn Arabic Treebank • We build specifications and implementation for an Arabic lexicographic web application.