E lex presentation_03

Lexical Profiling for Arabic

Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia
Tounsi, Josef van Genabith

National Centre for Language Technology (NCLT),
School of Computing, Dublin City University

Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET

Overview
• Introduction
• Building the lexical database for Arabic
– Corpus-based Selection of Entries
– Morphological Details: Inflectional Paradigms
– Syntactic Details: Subcategorization Frames
• Web Application
• Conclusion

Introduction
• Modern Standard Arabic vs. Classical
Arabic
• Current State of Arabic Lexicography
– Lexicons are not corpus-based
– Buckwalter Electronic Dictionary and Arabic
Morphological Analyser
– No lexica for subcategorization frames
• Importance of Lexical Resources

Introduction
• Arabic Morphotactics

Aim
• Constructing a lexical database of Modern
Standard Arabic
• Constructing a database for Arabic
subcategorization frames

Methodology
Lexical Details
• Using a medium-scale manually created lexicon of
10,799 lemmas
• Using statistics from a 1 billion word corpus (annotated
by MADA)
– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
Subcategorization Details
• Using a medium-scale manually created lexicon of 2,901
lemma-frame types
• Using the Penn Arabic Treebank of 22,524 sentences,
and 587,665 words

Extending the Lexical Database

• Start-off with a seed lexicon
– Three Lexical Databases, manually constructed
• 5,925 nominal lemmas, with details on:
– Gender and number
– Inflection paradigm (13 continuation classes)
– Humanness
• 1,529 verb lemmas, with details on:
– Transitivity
– Whether passive is allowed or not
– Whether the imperative is allowed or not
• 490 patterns (456 for nominals and 34 for verbs)
• lemma-root look up database


• Automatically Extending the Lexical
Database: Lexical Enrichment
– Data-driven filtering technique
• 40,648 lemmas (in Buckwalter or SAMA 3.1)
• Statistics from three web search engines
• Statistics from the corpus annotated by MADA
• 29,627 lemmas (left after filtering)


Automatically Extending the Lexical
Database: Feature Enrichment
– Machine Learning
– Multilayer Peceptron classification algorithm
– Training Data: 4,816 nominals and 1,448 verbs
– Classes for nominals: continuation classes (or inflection
paths), the semantico-grammatical feature of humanness,
and POS (noun or adjective)
– Classes for verbs: transitivity, allowing the passive voice,
and allowing the imperative mood
– We feed these datasets with frequency statistics from the
corpus and build a vector grid.


• Extending the Lexical Database
– Feature enrichment using Machine Learning


• Extending the Lexical Database
– With Machine Learning we add:

18,000 new lemmas:

12,974 nominals

5,034 verbs


• Handling Broken Plurals
jAnib (side)
jawAnib (sides)

Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc>
<pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc>
<pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss


• Extracting Broken Plurals
<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>


We use Levenshtein Distance which measures the difference
between two strings (here glosses having the same lemmaID).

distance of 2 / length of the first string = 0.15
(within the threshold 0.4)


We collect 2,266 candidates


• Validating Broken Plurals
<voc>jAnib</voc> singular
pattern is: fAEil
regex is: .A.i.
<voc>jawAnib</voc> plural
pattern is: fawAEil
regex is: .awA.i.
Pattern database: 135 singular patterns that choose from a
set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)


• Interesting statistics on Arabic plurals
Insights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms

Extraction of Subcat Frames

• Importance of subcategorization frames

• Advantage of Automatic Extraction

• Available Resource on Arabic Subcat
Frames:
– none except Arabic LFG Parser (Attia,
2008) - available as open source


What are LFG subcat frames?

Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP
and XCOMP)

Non-governable GFs (ADJ and XADJ)

π<gf1,gf2,…gfn>

{iEotamada Al-Tifolu EalaY wAlidati-hi
“The child relied on his mother”

{iEotamada<(↑SUBJ)( ↑OBL>alaY)>


Automatic extraction of subcat frames

The ATB contains 22,524 sentences

LFG Annotation algorithm (DCU)

Traversing trees and looking for dependencies.

Lemmatization

We extract

7,746 lemma-frame types (for verbs, nouns and
adjectives)


Estimating the Subcategorization Probability


Evaluation the Subcategorization Extraction

Web Application
• AraComLex Lexicon Writing Application
www.cngl.ie/aracomlex

Byproducts of the Work
A number of open-source Resources:
• finite-state morphological transducer

Arabic morphological patterns

Subcategorization frames

Arabic lemma frequency counts

Conclusion
• We successfully use machine learning to predict
morpho-syntactic features for newly acquired
words.
• We successfully extract subcategorization
frames from the Penn Arabic Treebank
• We build specifications and implementation for
an Arabic lexicographic web application.

E lex presentation_03

More Related Content

Viewers also liked

Similar to E lex presentation_03

E lex presentation_03