Information retrieval

 Tokenization is the process of breaking a stream of text up into words, phrases, symbols,
or other meaningful elements called tokens.
 A tokenizer relies on simple heuristic. Example
 A continuous stream of alphabets are part of one token
 Tokens are separated by white spaces or punctuation
 Punctuations and white space may or may not be included in the resulting list of tokens

 Words filtered out prior to or after processing of natural language data
 No definite word list
 Features
 Extremely common words
 Contribute minimal in helping selecting documents
 Most common example
 Function words such as the, is, at, which, on
 Most common words including lexical words

 Strategy –
 Sort the terms by collection frequency
 Take the most frequent documents
 Advantage – using a stop list greatly reduces the number of postings a system has to store
 Exemption – phrase search (“Flight to London”)

The goal is to reduce inflectional and derivationally related forms of a word to a common
base form. Example
am, are, is => be
car, cars, car’s, cars’ => car
The result of this mapping of text would be
The boy’s cars are different colours => the boy car be differ colour

 Stemming usually refers to process that chops off the end of words. Includes removal of
derivational affixes
 Lemmatisation refers to doing things properly with the use of vocabulary and
morphological analysis of words
 Aims to remove the inflectional ending
 Returns the base or dictionary form called lemma
word => saw
stemming => s
Lemmatisation => see, saw

 Consist of 5 phrase of word reductions, applied sequentially
 Within each phrase there are various conventions to select rules
 Measure of a word – loosely check the number of syllables to see whether a word is long
enough that it is reasonable to regard the matching portion of the rule as a suffix rather
than as part of stem of the word
(m>1) EMENT ->
Would map replacement to replace but not cement to c

 Porter stemmer stems all of the following words – operate, operating, operates, operation,
operative, operatives, operational to oper
 We will loose considerable precision
 Operational and research
 Operating and system
 Operative and dentistry

 Lookup Algorithm
 Looks for the inflected form in a lookup table
 Simple, fast and easy exception handling
 New/unfamiliar words are not handled
 The production technique
 The lookup table is generally produced unautomatically
 Ex. run => running, runs, runned, runly
 The last two forms are valid but unlikely
 Suffix-scripted algorithm
 A set of rules provide path for algorithm
 if the word ends in 'ed', remove the 'ed'
 if the word ends in 'ing', remove the 'ing'
 if the word ends in 'ly', remove the 'ly'

 Subtask of information extraction
 Seeks to locate and classify elements into pre-defined categories such as names of person,
organization, location, quantities, monetary values
 Takes unannotated block of text. like
Jim bought 300 shares of Acme Corp. in 2006
and produces unannotated block of text that highlights the names of entity
[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time

 Linguistic category of words
 Noun- any abstract or concrete entity. A person, place, thing, idea
 Pronoun- a substitute for noun/noun phrase
 Adjective – a qualifier of a noun
 Verb- an action, occurrence, or a state of being
 Adverb – any qualifier of an adjective
 Preposition – any establisher of relation or syntactic content
 Conjunction – any syntactic connector
 Interjection – an emotional greeting

 Process of analysing a string of symbols
 Analysis of a sentence by a computer into its constituents
 Results in parse tree showing their syntactic relation to each other

Information retrieval

Recommended

Recommended

More Related Content

Similar to Information retrieval

Similar to Information retrieval (20)

More from Ujjawal

More from Ujjawal (10)

Recently uploaded

Recently uploaded (20)

Information retrieval