Information Retrieval BasedOn Word SensAthman HajhamouComputer and Modeling Laboratory –USMBA- FSDM – Fès                 ...
Summary Research domain Characteristics of classical arabic Morphological processing Research problem Semantic approc...
Research domain   Natural Language Processing (NLP) :    is a theoretically motivated range of computational    technique...
Research domain   Levels of Natural Language Processing :    Phonology.    Morphology.    Lexical.    Syntactic.    ...
Research domain   Levels of Natural Language Processing :     Phonology :     this level deals with the interpretation o...
Research domain   Levels of Natural Language Processing :     Morphology :     this level deals with the componential na...
Research domain   Levels of Natural Language Processing :     Lexical :     At this level, the words that have only one ...
Research domain   Levels of Natural Language Processing :     Syntactic :     This level focuses on analyzing the words ...
Research domain   Levels of Natural Language Processing :     Semantic :     This is the level at witch most people thin...
Research domain   Information Retrieval (IR):    Can be defined as a study of how to    determine and retrieve from a cor...
Research domain   Information Retrieval (IR):    A user of the store has to express his information    need as a request ...
Characteristics of classicalarabic   The Arabic Language raise several    challenges to Natural Language    Processing (N...
Characteristics of classicalarabic   The Arabic Language is a semantic    language with a composite morphology.    Arabic...
Morphological processing   Almost    all   information    retrieval    systems work in the same way and    pass several s...
Morphological processing   Pre-processing :    document content is pre-processed    before search process. Pre-processing...
Morphological processingPre-processing : Lexical analysis of the text : the text of every text file is converted into a ...
Morphological processing   Pre-processing :   Elimination of the stop words :    Stop words are words which are too    f...
Morphological processing   Pre-processing :   Remove diacritics :    short vowels and other diacritics are    removed fr...
Morphological processing   Pre-processing :   Normalization of the words:    is the process of unification of different ...
Morphological processing   Pre-processing :   Stemming :    stemming of the remaining words with    objective of remaini...
Morphological processing   Pre-processing :   Selection of index term :    Index term or Keyword a pre-selected    term ...
Morphological processing   Search method:    is based on the root of the word, each    word of the user query is go back ...
Research problem   Synonymy and polysemy are two    important areas in linguistics that    present a problem for computat...
Research problem   Synonymy :    is the phenomenon where different    words describe the same idea. Thus, a    query in a...
Research problem   Polysemy :    is the phenomenon where the same    word has multiple meanings. So a    search     may  ...
Semantic approches   Automatic discovery of similar words :    the underlying goal of this approach is    in general the ...
Semantic approches   Automatic discovery of similar words :    among the existing methods we find :    techniques that, u...
Semantic approches   Automatic discovery of similar words :    the basic assumption of most of these    approaches is tha...
Semantic approches   Automatic discovery of similar words :    the basic assumption of most of these    approaches is tha...
Semantic approches   Term Selection :    one approches of term selection problem    is based on the co-occurrence of    “...
Semantic approches   Synonyms based search method:    this search method is based on the    synonyms of the words. Each w...
References P. Senellart and V. D. Blondel, „Automatic  discovery of similar words‟, Survey of text  mining book,pp. 26-44...
Upcoming SlideShare
Loading in …5
×

Information retrieval based on word sens 1

665 views
539 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
665
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Information retrieval based on word sens 1

  1. 1. Information Retrieval BasedOn Word SensAthman HajhamouComputer and Modeling Laboratory –USMBA- FSDM – Fès 1
  2. 2. Summary Research domain Characteristics of classical arabic Morphological processing Research problem Semantic approches 2
  3. 3. Research domain Natural Language Processing (NLP) : is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications. 3
  4. 4. Research domain Levels of Natural Language Processing : Phonology. Morphology. Lexical. Syntactic. Semantic. 4
  5. 5. Research domain Levels of Natural Language Processing :  Phonology : this level deals with the interpretation of speech sounds within and across words. In a NLP system that accept spoken input, the sound waves are analyzed and encoded into digitized signal for interpretation. 5
  6. 6. Research domain Levels of Natural Language Processing :  Morphology : this level deals with the componential nature of words, which are composed of morphemes – the smallest units of meaning. For example the word can be morphologically analyzed into three separate morphemes: the prefix , the root , and the suffix . NLP system can recognize the meaning conveyed by each morpheme in order to gain and represent meaning. 6
  7. 7. Research domain Levels of Natural Language Processing :  Lexical : At this level, the words that have only one possible sense or meaning can be replaced by a semantic representation of that meanings. The nature of the representation varies according to the semantic theory utilized in the NLP system. The lexical level may require a lexicon an the particular approach taken by NLP system will determine whether a lexicon will be utilized, as well as the nature and extent of the information that is encoded in the lexicon. 7
  8. 8. Research domain Levels of Natural Language Processing :  Syntactic : This level focuses on analyzing the words in a sentence and so as to uncover the grammatical structure of the sentence. The output of this level of processing is a representation of the sentence that reveals the structural dependency relationships between the words. Syntax conveys meaning in most languages because order and dependency contribute to meaning. 8
  9. 9. Research domain Levels of Natural Language Processing :  Semantic : This is the level at witch most people think mining is determined, however, as we can see in the above defining of the levels, it is all the levels that contribute to meaning. Semantic processing determines the possible meanings of a sentence by focusing on the interactions among word-level meanings in the sentences. This level of processing can include the semantic disambiguation of words with multiple senses. Semantic disambiguation permits one and only one sense of polysemous words to be selected. 9
  10. 10. Research domain Information Retrieval (IR): Can be defined as a study of how to determine and retrieve from a corpus of stored information the portion witch are relevant to particular information need. The information may be stored in a structured form or in a unstructured form, depending upon its applications 10
  11. 11. Research domain Information Retrieval (IR): A user of the store has to express his information need as a request for information in one form or another. Thus IR is concerned with the determining and retrieving of information that is relevant to his information need as expressed by his request and translated into a query witch conforms to a specific information retrieval system (IRS). An IRS normally stores surrogates of the actually documents in the system to represent the documents and the information stored in them. 11
  12. 12. Characteristics of classicalarabic The Arabic Language raise several challenges to Natural Language Processing (NLP) largely due to its rich morphology. Morphological processing becomes particulary important for Information retrieval (IR), because IR needs to determine an appropriate form of words as index. 12
  13. 13. Characteristics of classicalarabic The Arabic Language is a semantic language with a composite morphology. Arabic words are categorized as particles, nouns, or verbs. Unlike most western languages, Arabic script writing orientation is from right to left. There are 28 characters in Arabic. The characters are connected and do not start with capital letter. Most of the characters differ in shape based in their position in the sentence and adjunct letters. 13
  14. 14. Morphological processing Almost all information retrieval systems work in the same way and pass several steps before retrieve the most relevant documents in the field of some formulated queries. These steps deal with a set of documents and its text contents deal with representations of documents. 14
  15. 15. Morphological processing Pre-processing : document content is pre-processed before search process. Pre-processing can be divided into four text operations :  Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks. Elimination of the stop words. Remove diacritics. Normalization of the word. Stemming. Selection of index term. 15
  16. 16. Morphological processingPre-processing : Lexical analysis of the text : the text of every text file is converted into a stream of words (the candidate words to be adopted as index). The following three case have to be considered with care : not Arabic word, punctuation marks, digits. 16
  17. 17. Morphological processing Pre-processing : Elimination of the stop words : Stop words are words which are too frequent among text files which do not carry a particular and useful meaning for IR. Elimination of stop words reduces the size of the indexing structure. 17
  18. 18. Morphological processing Pre-processing : Remove diacritics : short vowels and other diacritics are removed from every text file. Short vowels include the fatha, domma, and kasra. Others diacritics such as the shadda, sikkun, and tanween. 18
  19. 19. Morphological processing Pre-processing : Normalization of the words: is the process of unification of different form of the same letter. 19
  20. 20. Morphological processing Pre-processing : Stemming : stemming of the remaining words with objective of remaining affixes (prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms. (Mountassire) 20
  21. 21. Morphological processing Pre-processing : Selection of index term : Index term or Keyword a pre-selected term which can be used to refer to the content of a document. 21
  22. 22. Morphological processing Search method: is based on the root of the word, each word of the user query is go back to the previous phase (text files pre- processing) and do all pre-processing steps. Each root words of the user query is matched to the root word in the index table and retrieve documents or portions of documents that have the same root word. 22
  23. 23. Research problem Synonymy and polysemy are two important areas in linguistics that present a problem for computational linguistics. They complicate the task of natural language processing because it‟s difficult to know when two names mean the same thing and it‟s difficult to know the sense of a name that has multiple meanings (doing so requires word-sense disambiguation). 23
  24. 24. Research problem Synonymy : is the phenomenon where different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query. For example, a search for " " may not return a document containing the word " ", even though the words have the same meaning. 24
  25. 25. Research problem Polysemy : is the phenomenon where the same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents. 25
  26. 26. Semantic approches Automatic discovery of similar words : the underlying goal of this approach is in general the automatic discovery of synonyms. Most methods provide words that are “similar” to each other, with some vague notion of semantic similarity. 26
  27. 27. Semantic approches Automatic discovery of similar words : among the existing methods we find : techniques that, upon input of a word, automatically compile a list of good synonyms or near- synonyms, and techniques that generate a thesaurus (from some source, they built a complete lexicon of related words ). 27
  28. 28. Semantic approches Automatic discovery of similar words : the basic assumption of most of these approaches is that words are similar if they are used in the same contexts. The methods differ in the way the contexts are defined and the way the similarity function is computed. 28
  29. 29. Semantic approches Automatic discovery of similar words : the basic assumption of most of these approaches is that words are similar if they are used in the same contexts. The methods differ in the way the contexts are defined and the way the similarity function is computed. 29
  30. 30. Semantic approches Term Selection : one approches of term selection problem is based on the co-occurrence of “similar” terms in “the same context”. We use the notion of term profile to calculate term quality and select the best quality index terms. The quality of a term t is based on distribution of terms “similar” to t and co-occurring in sentences across the document collection. 30
  31. 31. Semantic approches Synonyms based search method: this search method is based on the synonyms of the words. Each word of the user query go to an arabic thesaurus and get the synonyms of each word. Each synonyms word of the user query is marched to the same word in the index table. 31
  32. 32. References P. Senellart and V. D. Blondel, „Automatic discovery of similar words‟, Survey of text mining book,pp. 26-44. 2003. A. T. Al-Taani and A. M. Al-Gharaibeh, „Searching Concepts and Keywords in the holy Quran‟, Yarmou University, Jordan. I. Dhillon and J. Kogan and C. Nicholas, „Feature selection and document clustering‟, Survey of text mining book,pp. 73-100. 2003. ED Liddy, Natural language processing- Introduction. 2001. 32

×