Information retrieval based on word sens 1

Information Retrieval Based
On Word Sens
Athman Hajhamou
Computer and Modeling Laboratory –
USMBA- FSDM – Fès

1

Summary
 Research domain
 Characteristics of classical arabic
 Morphological processing
 Research problem
 Semantic approches

2

Research domain
 Natural Language Processing (NLP) :
is a theoretically motivated range of computational
techniques for analyzing and representing naturally
occurring texts at one or more levels of linguistic
analysis for the purpose of achieving human-like
language processing for a range of tasks or
applications.

3

Research domain
 Levels of Natural Language Processing :
Phonology.
Morphology.
Lexical.
Syntactic.
Semantic.

4

Research domain
 Phonology :
this level deals with the interpretation of speech
sounds within and across words. In a NLP
system that accept spoken input, the sound
waves are analyzed and encoded into digitized
signal for interpretation.

5

Research domain
 Morphology :
this level deals with the componential nature of
words, which are composed of morphemes – the
smallest units of meaning. For example the word
can be morphologically analyzed into
three separate morphemes: the prefix , the root
, and the suffix . NLP system can recognize
the meaning conveyed by each morpheme in
order to gain and represent meaning.

6

Research domain
 Lexical :
At this level, the words that have only one
possible sense or meaning can be replaced by a
semantic representation of that meanings. The
nature of the representation varies according to
the semantic theory utilized in the NLP system.
The lexical level may require a lexicon an the
particular approach taken by NLP system will
determine whether a lexicon will be utilized, as
well as the nature and extent of the information
that is encoded in the lexicon.

7

Research domain
 Syntactic :
This level focuses on analyzing the words in a
sentence and so as to uncover the grammatical
structure of the sentence. The output of this level
of processing is a representation of the sentence
that reveals the structural dependency
relationships between the words. Syntax
conveys meaning in most languages because
order and dependency contribute to meaning.

8

Research domain
 Semantic :
This is the level at witch most people think mining is
determined, however, as we can see in the above
defining of the levels, it is all the levels that contribute
to meaning. Semantic processing determines the
possible meanings of a sentence by focusing on the
interactions among word-level meanings in the
sentences. This level of processing can include the
semantic disambiguation of words with multiple
senses. Semantic disambiguation permits one and
only one sense of polysemous words to be selected.

9

Research domain
 Information Retrieval (IR):
Can be defined as a study of how to
determine and retrieve from a corpus
of stored information the portion witch
are relevant to particular information
need. The information may be stored
in a structured form or in a
unstructured form, depending upon its
applications

10

Research domain
 Information Retrieval (IR):
A user of the store has to express his information
need as a request for information in one form or
another. Thus IR is concerned with the
determining and retrieving of information that is
relevant to his information need as expressed by
his request and translated into a query witch
conforms to a specific information retrieval
system (IRS). An IRS normally stores surrogates
of the actually documents in the system to
represent the documents and the information
stored in them.

11

Characteristics of classical
arabic
 The Arabic Language raise several
challenges to Natural Language
Processing (NLP) largely due to its
rich morphology. Morphological
processing becomes particulary
important for Information retrieval (IR),
because IR needs to determine an
appropriate form of words as index.

12

Characteristics of classical
arabic
 The Arabic Language is a semantic
language with a composite morphology.
Arabic words are categorized as
particles, nouns, or verbs. Unlike most
western languages, Arabic script writing
orientation is from right to left. There are
28 characters in Arabic. The characters
are connected and do not start with
capital letter. Most of the characters
differ in shape based in their position in
the sentence and adjunct letters.

13

Morphological processing

 Almost all information retrieval
systems work in the same way and
pass several steps before retrieve the
most relevant documents in the field of
some formulated queries. These steps
deal with a set of documents and its
text contents deal with representations
of documents.

14


 Pre-processing :
document content is pre-processed
before search process. Pre-processing
can be divided into four text operations :
 Lexical analysis of the text with the objective
of treating digits, hyphens, punctuation
marks.
Elimination of the stop words.
Remove diacritics.
Normalization of the word.
Stemming.
Selection of index term.

15


Pre-processing :
 Lexical analysis of the text :
the text of every text file is converted
into a stream of words (the candidate
words to be adopted as index). The
following three case have to be
considered with care : not Arabic
word, punctuation marks, digits.

16


 Elimination of the stop words :
Stop words are words which are too
frequent among text files which do not
carry a particular and useful meaning
for IR. Elimination of stop words
reduces the size of the indexing
structure.

17


 Remove diacritics :
short vowels and other diacritics are
removed from every text file. Short
vowels include the fatha, domma, and
kasra. Others diacritics such as the
shadda, sikkun, and tanween.

18


 Normalization of the words:
is the process of unification of different
form of the same letter.

19


 Stemming :
stemming of the remaining words with
objective of remaining affixes (prefixes
and suffixes) and allowing the retrieval
of documents containing syntactic
variations of query terms.
(Mountassire)

20


 Selection of index term :
Index term or Keyword a pre-selected
term which can be used to refer to the
content of a document.

21


 Search method:
is based on the root of the word, each
word of the user query is go back to the
previous phase (text files pre-
processing) and do all pre-processing
steps. Each root words of the user query
is matched to the root word in the index
table and retrieve documents or portions
of documents that have the same root
word.

22

Research problem

 Synonymy and polysemy are two
important areas in linguistics that
present a problem for computational
linguistics. They complicate the task of
natural language processing because
it‟s difficult to know when two names
mean the same thing and it‟s difficult
to know the sense of a name that has
multiple meanings (doing so requires
word-sense disambiguation).

23

Research problem

 Synonymy :
is the phenomenon where different
words describe the same idea. Thus, a
query in a search engine may fail to
retrieve a relevant document that does
not contain the words which appeared in
the query. For example, a search for " "
may not return a document containing
the word " ", even though the words
have the same meaning.
24

Research problem

 Polysemy :
is the phenomenon where the same
word has multiple meanings. So a
search may retrieve irrelevant
documents containing the desired
words in the wrong meaning. For
example, a botanist and a computer
scientist looking for the word "tree"
probably desire different sets of
documents.

25

Semantic approches
 Automatic discovery of similar words :
the underlying goal of this approach is
in general the automatic discovery of
synonyms. Most methods provide
words that are “similar” to each other,
with some vague notion of semantic
similarity.

26

Semantic approches
among the existing methods we find :
techniques that, upon input of a
word, automatically compile a list of
good synonyms or near-
synonyms, and techniques that
generate a thesaurus (from some
source, they built a complete lexicon
of related words ).

27

Semantic approches
the basic assumption of most of these
approaches is that words are similar if
they are used in the same contexts.
The methods differ in the way the
contexts are defined and the way the
similarity function is computed.

28

Semantic approches
the basic assumption of most of these
approaches is that words are similar if
they are used in the same contexts.
The methods differ in the way the
contexts are defined and the way the
similarity function is computed.

29

Semantic approches
 Term Selection :
one approches of term selection problem
is based on the co-occurrence of
“similar” terms in “the same context”. We
use the notion of term profile to calculate
term quality and select the best quality
index terms. The quality of a term t is
based on distribution of terms “similar” to
t and co-occurring in sentences across
the document collection.

30

Semantic approches
 Synonyms based search method:
this search method is based on the
synonyms of the words. Each word of
the user query go to an arabic
thesaurus and get the synonyms of
each word. Each synonyms word of
the user query is marched to the same
word in the index table.

31

References
 P. Senellart and V. D. Blondel, „Automatic
discovery of similar words‟, Survey of text
mining book,pp. 26-44. 2003.
 A. T. Al-Taani and A. M. Al-Gharaibeh,
„Searching Concepts and Keywords in the
holy Quran‟, Yarmou University, Jordan.
 I. Dhillon and J. Kogan and C. Nicholas,
„Feature selection and document clustering‟,
Survey of text mining book,pp. 73-100. 2003.
 ED Liddy, Natural language processing-
Introduction. 2001.

32

Information retrieval based on word sens 1

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (16)

Similar to Information retrieval based on word sens 1

Similar to Information retrieval based on word sens 1 (20)

Recently uploaded

Recently uploaded (20)

Information retrieval based on word sens 1