SlideShare a Scribd company logo
1 of 26
Download to read offline
An Open-Source Finite State
Morphological Transducer for Modern
         Standard Arabic

 Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi,
                   Josef van Genabith

        National Centre for Language Technology (NCLT),
            School of Computing, Dublin City University

                               Funded by:
        Enterprise Ireland, the Irish Research Council for Science
              Engineering and Technology (IRCSET), and
              the EU projects PANACEA and META-NET
Contribution
• We develop a finite state morphological
  transducer for Modern Standard Arabic
  1. Open source, distributed under the GPLv3 license
  2. Large scale, more than 30,000 lemmas
  3. Corpus based, truly representative of Modern
     Standard Arabic and not Classical Arabic.
  4. Compatible with Foma, an open-source fst compiler
Short Tutorial
(1) Download Foma
     http://foma.sourceforge.net

(2) Download AraComLex
     http://aracomlex.sourceforge.net

(3) Build the transducer: README
The transducer online
• You can test the transducer online:
    http://www.cngl.ie/aracomlex
Introduction
• Modern Standard Arabic vs. Classical
  Arabic
• Current State of Arabic Lexicography
  – Lexicons are not corpus-based
  – Buckwalter Arabic Morphological Analyser
• Importance of Lexical Resources
Introduction
 • Arabic Morphotactics




lemma
Aim
• Building a finite-state morphological
  transducer
• Constructing a lexical database of Modern
  Standard Arabic
Methodology
• Using Open-Source Finite State
  Technology
• Using statistics from a 1 billion word
  corpus
  – 90% from the LDC's Arabic Gigaword
  – 10% collected from the Al-Jazeera website
• Using a medium-scale manually created
  lexicon of 10,799 lemmas
Methodology
• Using Finite State Technology (XFST)
  – Bidrectional: Suitable for analysis and generation
  – handles concatenative and non-concatenative
    morphotactics
  – Speed and efficiency in dealing with millions of
    paths
  – Handles separated dependencies.
  – Handles phonological and orthographic changes
    through alteration rules.
Methodology
• Design Approach:
  Three approaches
  – Root-based Morphology
  Xerox Arabic FTM
  – Stem-based morphology
  Buckwalter
  $kr   $akar   PV thank;give thanks
  $kr   $okur   IV   thank;give thanks
  – Lemma-based morphology
Methodology
Our Approach:
Lemma-based morphology
Methodology
Methodology
Alteration Rules:
Alteration Rules are used for handling discrepancies
between surface forms and underlying representation or
lemmas. We have 130 replace rules.
      a -> b || L _ R
Results to Date
• Start-off with a seed lexicon
  – Four Lexical Databases, manually constructed
     •   5,925 nominal lemmas
     •   1,529 verb lemmas
     •   490 patterns (456 for nominals and 34 for verbs)
     •   lemma-root look up database
Results to Date
• Automatically Extending the Lexical
  Database: Lexical Enrichment
  – Data-driven filtering technique
     •   40,648 lemmas (in Buckwalter or SAMA 3.1)
     •   Statistics from three web search engines
     •   Statistics from the corpus annotated by MADA
     •   29,627 lemmas (left after filtering)
Results to Date
Automatically Extending the Lexical
Database: Feature Enrichment
  – Machine Learning
  – Multilayer Peceptron classification algorithm
  – Training Data: 4,816 nominals and 1,448 verbs
  – Classes for nominals: continuation classes (or inflection
    paths), the semantico-grammatical feature of humanness,
    and POS (noun or adjective)
  – Classes for verbs: transitivity, allowing the passive voice,
    and allowing the imperative mood
  – We feed these datasets with frequency statistics from the
    corpus and build a vector grid.
Results to Date
• Extending the Lexical Database
  – Feature enrichment using Machine Learning
Results to Date
• Extending the Lexical Database
  – With Machine Learning we add:
   
     18,000 new lemmas:
    
        12,974 nominals
    
        5,034 verbs
Results to Date
• Handling Broken Plurals
                  jAnib     (side)
                  jawAnib   (sides)

Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc>
    <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc>
   <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss
Results to Date
• Extracting Broken Plurals
<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>


 We use Levenshtein Distance which measures the difference
between two strings (here glosses having the same lemmaID).

      distance of 2 / length of the first string = 0.15
      (within the threshold 0.4)


    We collect 2,266 candidates
Results to Date
• Validating Broken Plurals
<voc>jAnib</voc>         singular
                         pattern is: fAEil
                         regex is: .A.i.
<voc>jawAnib</voc> plural
                         pattern is: fawAEil
                         regex is: .awA.i.
Pattern database: 135 singular patterns that choose from a
set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)
Results to Date
• Interesting statistics on Arabic plurals
Insights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms
Results to Date
• AraComLex Lexicon Writing Application
Results to Date
• FST Morphology Coverage and RPW
  Results
  – a test corpus of 800,000 words, divided as
     • 400,000 for Semi-Literary text
     • 400,000 for General News texts.
Future Work
• Going beyond SAMA
• Including Named Entities and MWEs
Conclusion
• Open-source finite state transducer for Modern
  Standard Arabic (AraComLex) distributed under
  the GPLv3 license.
• We successfully use machine learning to predict
  morpho-syntactic features for newly acquired
  words.
• Comparing our morphological transducer to
  SAMA, we find that we achieve comparable
  coverage and lower rate of analyses per word.

More Related Content

Viewers also liked

Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningNoovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningGiorgio Pedrazzi
 
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioDronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioGiorgio Pedrazzi
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02Mohammed Attia
 
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiSmau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiGiorgio Pedrazzi
 
Getting ready for work
Getting ready for workGetting ready for work
Getting ready for workophelia23
 
Teacher training course
Teacher training courseTeacher training course
Teacher training courseMohammed Attia
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04Mohammed Attia
 
119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-pptZulfikar Fikar
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activitiesMohammed Attia
 

Viewers also liked (13)

Esp
EspEsp
Esp
 
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningNoovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
 
Esp
EspEsp
Esp
 
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioDronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02
 
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiSmau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
 
Getting ready for work
Getting ready for workGetting ready for work
Getting ready for work
 
Teacher training course
Teacher training courseTeacher training course
Teacher training course
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04
 
119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activities
 
Female pelvis ppt
Female pelvis pptFemale pelvis ppt
Female pelvis ppt
 
Katalog2010
Katalog2010Katalog2010
Katalog2010
 

Similar to Fsmnlp presentation mohammed_attia

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Ahmed Magdy Ezzeldin, MSc.
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit IIIManoj Patil
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
Contribution of recurrent connectionist language models in improving lstm bas...
Contribution of recurrent connectionist language models in improving lstm bas...Contribution of recurrent connectionist language models in improving lstm bas...
Contribution of recurrent connectionist language models in improving lstm bas...anna8885
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsMauro Dragoni
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic ApproachProsodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic ApproachChristophe Veaux
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
Arabic mwe presentation 07
Arabic mwe presentation 07Arabic mwe presentation 07
Arabic mwe presentation 07Mohammed Attia
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 

Similar to Fsmnlp presentation mohammed_attia (20)

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Text summarization
Text summarization Text summarization
Text summarization
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit III
 
Asr
AsrAsr
Asr
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic Language
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Contribution of recurrent connectionist language models in improving lstm bas...
Contribution of recurrent connectionist language models in improving lstm bas...Contribution of recurrent connectionist language models in improving lstm bas...
Contribution of recurrent connectionist language models in improving lstm bas...
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR Systems
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic ApproachProsodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach
Prosodic Control of Unit-Selection Speech Synthesis: A Probabilistic Approach
 
Parkinson mibbi
Parkinson mibbiParkinson mibbi
Parkinson mibbi
 
SEppt
SEpptSEppt
SEppt
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
Arabic mwe presentation 07
Arabic mwe presentation 07Arabic mwe presentation 07
Arabic mwe presentation 07
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 

Fsmnlp presentation mohammed_attia

  • 1. An Open-Source Finite State Morphological Transducer for Modern Standard Arabic Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith National Centre for Language Technology (NCLT), School of Computing, Dublin City University Funded by: Enterprise Ireland, the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA and META-NET
  • 2. Contribution • We develop a finite state morphological transducer for Modern Standard Arabic 1. Open source, distributed under the GPLv3 license 2. Large scale, more than 30,000 lemmas 3. Corpus based, truly representative of Modern Standard Arabic and not Classical Arabic. 4. Compatible with Foma, an open-source fst compiler
  • 3. Short Tutorial (1) Download Foma http://foma.sourceforge.net (2) Download AraComLex http://aracomlex.sourceforge.net (3) Build the transducer: README
  • 4. The transducer online • You can test the transducer online: http://www.cngl.ie/aracomlex
  • 5. Introduction • Modern Standard Arabic vs. Classical Arabic • Current State of Arabic Lexicography – Lexicons are not corpus-based – Buckwalter Arabic Morphological Analyser • Importance of Lexical Resources
  • 6. Introduction • Arabic Morphotactics lemma
  • 7. Aim • Building a finite-state morphological transducer • Constructing a lexical database of Modern Standard Arabic
  • 8. Methodology • Using Open-Source Finite State Technology • Using statistics from a 1 billion word corpus – 90% from the LDC's Arabic Gigaword – 10% collected from the Al-Jazeera website • Using a medium-scale manually created lexicon of 10,799 lemmas
  • 9. Methodology • Using Finite State Technology (XFST) – Bidrectional: Suitable for analysis and generation – handles concatenative and non-concatenative morphotactics – Speed and efficiency in dealing with millions of paths – Handles separated dependencies. – Handles phonological and orthographic changes through alteration rules.
  • 10. Methodology • Design Approach: Three approaches – Root-based Morphology Xerox Arabic FTM – Stem-based morphology Buckwalter $kr $akar PV thank;give thanks $kr $okur IV thank;give thanks – Lemma-based morphology
  • 13. Methodology Alteration Rules: Alteration Rules are used for handling discrepancies between surface forms and underlying representation or lemmas. We have 130 replace rules. a -> b || L _ R
  • 14. Results to Date • Start-off with a seed lexicon – Four Lexical Databases, manually constructed • 5,925 nominal lemmas • 1,529 verb lemmas • 490 patterns (456 for nominals and 34 for verbs) • lemma-root look up database
  • 15. Results to Date • Automatically Extending the Lexical Database: Lexical Enrichment – Data-driven filtering technique • 40,648 lemmas (in Buckwalter or SAMA 3.1) • Statistics from three web search engines • Statistics from the corpus annotated by MADA • 29,627 lemmas (left after filtering)
  • 16. Results to Date Automatically Extending the Lexical Database: Feature Enrichment – Machine Learning – Multilayer Peceptron classification algorithm – Training Data: 4,816 nominals and 1,448 verbs – Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective) – Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood – We feed these datasets with frequency statistics from the corpus and build a vector grid.
  • 17. Results to Date • Extending the Lexical Database – Feature enrichment using Machine Learning
  • 18. Results to Date • Extending the Lexical Database – With Machine Learning we add:  18,000 new lemmas:  12,974 nominals  5,034 verbs
  • 19. Results to Date • Handling Broken Plurals jAnib (side) jawAnib (sides) Poor handling of broken plural in Buckwalter (4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss> (5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss> Two differences: voc and gloss
  • 20. Results to Date • Extracting Broken Plurals <gloss>side/aspect</gloss> <gloss>sides/aspects</gloss>  We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID). distance of 2 / length of the first string = 0.15 (within the threshold 0.4)  We collect 2,266 candidates
  • 21. Results to Date • Validating Broken Plurals <voc>jAnib</voc> singular pattern is: fAEil regex is: .A.i. <voc>jawAnib</voc> plural pattern is: fawAEil regex is: .awA.i. Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns 2,266 candidates -> 1,965 are validated (87%)
  • 22. Results to Date • Interesting statistics on Arabic plurals Insights from the corpus: 5,570 lemmas have a feminine plural suffix 1,942 lemmas have a masculine plural suffix 2,730 lemmas with a broken plural forms
  • 23. Results to Date • AraComLex Lexicon Writing Application
  • 24. Results to Date • FST Morphology Coverage and RPW Results – a test corpus of 800,000 words, divided as • 400,000 for Semi-Literary text • 400,000 for General News texts.
  • 25. Future Work • Going beyond SAMA • Including Named Entities and MWEs
  • 26. Conclusion • Open-source finite state transducer for Modern Standard Arabic (AraComLex) distributed under the GPLv3 license. • We successfully use machine learning to predict morpho-syntactic features for newly acquired words. • Comparing our morphological transducer to SAMA, we find that we achieve comparable coverage and lower rate of analyses per word.