Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav

3,307 views

Published on

A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf

2 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total views
3,307
On SlideShare
0
From Embeds
0
Number of Embeds
88
Actions
Shares
0
Downloads
73
Comments
2
Likes
1
Embeds 0
No embeds

No notes for slide

Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, And Future Trendsav

  1. 1. A Survey of Arabic Question AnsweringChallenges, Tasks, Approaches, Tools, and Future Trends Ahmed Magdy & Dr. Mohamed Shaheen ACIT 2012
  2. 2. Outline● Motivation● Question Answering Tasks - Question Analysis, Passage Retrieval, and Answer Extraction● Arabic Language Challenges● Approaches - Stemming, Named Entity Recognition, Language Resources● Tools● Future Trends And Open Issues
  3. 3. Motivation● Arabic is the 6th most important language● More than 300 million speakers● Increasing amounts of Arabic content on the Internet● Increasing demand for Information● There is no survey that covers Arabic Question Answering
  4. 4. Question Answering Tasks● Question Analysis● Passage Retrieval● Answer Extraction
  5. 5. Question Analysis● Tokenization & Normalization● Remove stop words● Named Entity Recognition (gazetteer, maxent model)● Stemming all words except Named Entities● Question Focus determination by extracting the main NE● Keywords Extraction & Expansion● Answer type extraction by question words (Name, Place, Date, Quantity)● Query generation of keywords into a Boolean formula● Experiments with cross-language Arabic/English QA● Not Promising because of Translation Ambiguity
  6. 6. Passage Retrieval● Systems used: – Salton’s vector space model based systems – JIRS passage retrieval system● Ranking retrieved passages according to: – Answer and Question words Count – Answer and Question words Association – Query words weight – Cosine similarity between documents words and question words – Distance Density N-gram Model
  7. 7. Answer Extraction● Ranking candidate answers according to: – Manual lexical patterns – Answer Snippet position – Question Word frequencies in Answer – Matching using N-grams – Select answers with NEs of the same expected answer type – Semantic similarity between the question’s focus and the answer
  8. 8. Challenges● Arabic Morphology is highly inflectional – Many affixes (articles, prepositions, pronouns etc.)● Arabic Morphology is highly derivational – 10,000 root and 120 pattern for derivation● No Capital Letters in Named Entities – Unlike Latin based languages● Scarceness of Arabic Language Resources – corpora, lexicons, and machine-readable dictionaries
  9. 9. Approaches● Stemming – Removing prefixes, suffixes and infixes from words – Match root with patterns – Language dependent rules – defining the most used affixed statistically● Named Entity Recognition – Maxent model or CRF – ANERcorp and ANERgazet● Language Resources – Arabic WordNet – Arabic Penn Tree Bank
  10. 10. Tools● NOOJ for Arabic NLP – C# .NET Freeware linguistic engineering development environment – Supports Regular Expressions and Context Free Grammars – Has Arabic Language resources (Sample Text and Dictionary)● Amine Platform – Java platform for intelligent systems and multi-agents – Used for semantic analysis of questions and answers – Uses Conceptual Graphs, Knowledge bases, and Ontologies● JIRS a Java Passage Retrieval – Search based on question n-grams – Based on the Space Vectorial Model – Simple N-gram Model (SNM) – Term-weight N-gram Model (TNM) – Distance N-gram Model
  11. 11. Tools [continued]● Arabic Stemmers – Khoja Arabic stemmer (With roots dictionary) – AraMorph (uses Transliteration to English Letters) – Information Science Research Institute’s (ISRI) stemmer (without a root dictionary)● GATE (General Architecture for Text Engineering) – Java based platform that composes of a tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a coreference tagger. – Plugins for machine learning with Weka, RASP, MAXENT, SVM Light – Managing ontologies like WordNet
  12. 12. Tools [continued]● OpenNLP – NLP tasks like tokenization, sentence segmentation, part-of- speech tagging, named entity extraction, chunking, parsing, maximum entropy, perceptron based machine learning, and coreference resolution● Stanford NLP – Java Framework with many NLP modules for: – Dependency parsers, and a lexicalized PCFG parser – Part-of-speech (POS) tagger – CRF-based Named Entity Recognizer – CRF-based Word Segmenter – Maxent Text Classifier – Tokens Regex: regular expressions over tokens
  13. 13. Future Trends and Open Issues● More research on Arabic restricted domain QA – Makes semantic tasks like word sense disambiguation easier – Domain rules affects how the question is posed and how the answer is formulated – A Restricted domain should be circumscribed, practical, and complex – E.g. Agriculture, Architectural Engineering or any field of science – But not news and current events as they have no constraints● Use of deep application dependent approaches – use application dependent constraints and rules to guide the question analysis and answer extraction and validation – Depending on the available resources
  14. 14. Future Trends and Open Issues [continued]● Intensive usage of semantics – Arabic QA focused on morpho-syntactic approaches – Very little used the Arabic Wordnet – Still a lot to be done in the field of word sense disambiguation, coreference resolution and ontology based reasoning● Use of theorem proving & deep reasoning● Use of logic-based and inference- based approaches
  15. 15. Summary● Motivation● Question Answering Tasks - Question Analysis, Passage Retrieval, and Answer Extraction● Arabic Language Challenges● Approaches - Stemming, Named Entity Recognition, Language Resources● Tools● Future Trends And Open Issues
  16. 16. Thank YouYou can view the Full Paper on ACIT 2012 Proceedings

×