SlideShare a Scribd company logo
ChapterTwo:Text Operations
andTermWeighting
Introduction to
Information Storage and Retrieval
1
Statistical Properties of Text
 How is the frequency of different words distributed?
 How fast does vocabulary size grow with the size of a
corpus?
 Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
 A few words are very common.
 2 most frequent words (e.g.“the”,“of”) can account for about 10% of
word occurrences in a document.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only once”
 Called a “heavy tailed” distribution, since most of the probability
mass is in the “tail”
2
Sample Word Frequency Data
3
Zipf’s distributions
Rank Frequency Distribution
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)
f
r
w has rank r and
frequency f
Distribution of sorted word
frequencies, according to
Zipf’s law
4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor George
Kingsley Zipf (1902-1950),
 attempts to capture the distribution of the frequencies (i.e. , number of
occurances ) of the words within a text.
 Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence
(most frequent words first), the occurence characterstics of the
vocabulary can be characterized by the constant rank-frequency
law of Zipf:
Frequency * Rank = constant
 If the words, w, in a collection are ranked, r, by their frequency, f,
they roughly fit the relation: r * f = c
 Different collections have different constants c.
5
Example: Zipf's Law
 The table shows the most frequently occurring words from 336,310
document collection containing 125, 720, 891 total words; out of
which there are 508,209 unique words
6
Zipf’s law: modeling word distribution
 The collection frequency of the ith most common term
is proportional to 1/i
 If the most frequent term occurs f1 then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
 Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word
 occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is
a power-law function Pi ~ 1/i Ө with the exponent Ө close to
unity.
i
fi
1

7
More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to
terms based on their frequency, with most
frequent words weighted less. Used by almost all
ranking methods.
9
Explanations for Zipf’s Law
 The law has been explained by “principle of least effort” which
makes it easier for a speaker or writer of a language to repeat
certain words instead of coining new and different words.
 Zipf’s explanation was his “principle of least effort” which balance between
speaker’s desire for a small vocabulary and hearer’s desire for a large one.
Zipf’s Law Impact on IR
 Good News: Stopwords will account for a large fraction of text so
eliminating them greatly reduces inverted-index storage costs.
 Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis for
query expansion) is difficult since they are extremely rare.
10
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word
occurrence in a text furnishes a useful
measurement of word significance.
Luhn suggested that both extremely common
and extremely uncommon words were not
very useful for indexing.
11
For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
 The words exceeding the upper cutoff were considered to be
common
 The words below the lower cutoff were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their rank in
decreasing order of word frequency, then a plot relating f & r yields the
following curve
Word significance: Luhn’s Ideas
12
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing.
13
Vocabulary size : Heaps’ Law
 Heap’s law:
How to estimates the number of vocabularies in a given corpus
 Dictionaries
 600,000 words and above
 But they do not include names of people, locations, products etc
14
Vocabulary Growth: Heaps’ Law
 How does the size of the overall vocabulary (number of unique
words) grow with the size of the corpus?
 This determines how the size of the inverted index will scale with the size
of the corpus.
 Heap’s law: estimates the number of vocabularies in a given
corpus
 The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.
 If V is the size of the vocabulary and n is the length of the corpus in words,
Heap’s provides the following equation:
 Where constants:
 K  10100
   0.40.6 (approx. square-root)

Kn
V 
15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
• Example: from 1,000,000,000 words, there may be
1,000,000 distinct words. Can you agree?
16
Example
 We want to estimate the size of the vocabulary for a corpus of
1,000,000 words. However, we only know the statistics computed
on smaller corpora sizes:
 For 100,000 words, there are 50,000 unique words
 For 500,000 words, there are 150,000 unique words
 Estimate the vocabulary size for the 1,000,000 words corpus?
 How about for a corpus of 1,000,000,000 words?
17
Relevance Measure
irrelevant
retrieved &
irrelevant
not retrieved
& irrelevant
relevant
retrieved &
relevant
not retrieved
but relevant
retrieved not retrieved
18
Issues: recall and precision
 breaking up hyphenated terms increase recall but
decrease precision
 preserving case distinctions enhance precision but
decrease recall
 commercial information systems usually take recall
enhancing approach (numbers and words containing
digits are index terms, and all are case insensitive)
19
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection to be
used as index terms
 Using the set of all words in a collection to index documents creates too
much noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the document
 Preprocessing is the process of controlling the size of the vocabulary or the
number of distinct words used as index terms
 Preprocessing will lead to an improvement in the information retrieval performance
 However, some search engines on theWeb omit preprocessing
 Every word in the document is an index term
20
Text Operations
 Text operations is the process of text transformations in to
logical representations
 The main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms:
 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations
marks, and the case of letters
 Elimination of stop words - filter out words which are not useful in the
retrieval process
 Stemming words - remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as thesaurus, to
capture relationship for allowing the expansion of the original query
with related terms
21
Generating Document Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an automatic retrieval
system
 The document representative consists of a list of class names, each name
representing a class of words occurring in the total input text. A document
will be indexed by a name if one of its significant words occurs as a member of that
class.
Tokenization stemming Thesaurus
Index
terms
stop words
documents
22
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted as
index terms
 Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999); but 510 B.C. –
unique
 Hyphen – break up the words (e.g. state-of-the-art = state of the art)-
but some words, e.g. gilt-edged, B-49 - unique words which require
hyphens
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe
 Case of letters – not important and can convert all to upper or lower
23
Tokenization
 Analyze text into a sequence of discrete tokens (words).
 Input:“Friends,Romans and Countrymen”
 Output:Tokens (an instance of a sequence of characters that are grouped
together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing
 But what are valid tokens to emit?
24
Issues in Tokenization
 One word or multiple: How do you decide it is one token or two
or more?
Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco,LosAngeles
lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
25
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens etc?
C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case (Republican
vs. republican) can be a meaningful part of a token.
 However, frequently they are not.
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic characters
as tokens.
 Generally, don’t index numbers as text, But often very useful.Will often
index “meta-data” , including creation date, format, etc. separately
 Issues of tokenization are language specific
 Requires the language to be known
 The application area for which the IR system would be
developed for also dictates the nature of valid tokens
26
Exercise: Tokenization
 The cat slept peacefully in the living room. It’s a very old
cat.
 Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
27
Elimination of STOPWORD
 Stopwords are extremely common words across document
collections that have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
 Examples of stopwords:
 articles (a, an, the);
 pronouns: (I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
 Stopwords are language dependent.
28
Stopwords
Intuition:
Stopwords have little semantic content; It is typical to remove
such high-frequency words
Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
Smaller indices for information retrieval
 Good compression techniques for indices:The 30 most common words
account for 30% of the tokens in written text
 Better approximation of importance for classification,
summarization, etc.
29
How to determine a list of stopwords?
 One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices,“insurance” would be a
stop word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
 With the removal of stopwords, we can measure better
approximation of importance for classification, summarization,
etc.
30
Stop words
 Stop word elimination used to be standard in older IR systems.
 But the trend is away from doing this. Most web search engines
index stop words:
Good query optimization techniques mean you pay little at
query time for including stop words.
You need stopwords for:
 Phrase queries:“King of Denmark”
 Various song titles, etc.:“Let it be”,“To be or not to be”
 “Relational” queries:“flights to London”
 Elimination of stopwords might reduce recall (e.g.“To be or
not to be” – all eliminated except “be” – no or irrelevant
retrieval)
31
Normalization
• It is Canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of
the tokens
 Need to “normalize” terms in indexed text as well as query terms into the
same form
 Example:We want to match U.S.A. and USA,by deleting periods in a
term
 Case Folding: Often best to lower case everything, since users will
use lowercase regardless of‘correct’ capitalization…
 Republican vs. republican
 Fasil vs. fasil vs.FASIL
 Anti-discriminatory vs. antidiscriminatory
 Car vs. automobile?
32
Normalization issues
 Good for
 Allow instances of Automobile at the beginning of a sentence to match
with a query of automobile
 Helps a search engine when most users type ferrari when they are
interested in a Ferrari car
 Bad for
 Proper names vs. common nouns
 E.g. General Motors,Associated Press, …
 Solution:
 lowercase only words at the beginning of the sentence
 In IR, lowercasing is most practical because of the way users issue
their queries
33
Term conflation
 One of the problems involved in the use of free text for
indexing and retrieval is the variation in word forms that is
likely to be encountered.
 The most common types of variations are
 spelling errors (father, fathor)
 alternative spellings i.e. locality or national usage (color vs
colour, labor vs labour)
 multi-word concepts (database, data base)
 affixes (dependent, independent, dependently)
 abbreviations (i.e., that is).
34
Function of conflation in terms of IR
 Reducing the total number of distinct terms to a
consequent reduction of dictionary size and
updating problems
Less terms to worry about
 Bringing similar words, having similar meanings
to a common form with the aim of increasing
retrieval effectiveness.
More accurate statistics
35
Stemming/Morphological analysis
Stemming reduces tokens to their “root” form of words to
recognize morphological variation .
The process involves removal of affixes (i.e. prefixes and
suffixes) with the aim of reducing variants to the same stem
Often removes inflectional and derivational morphology of a
word
 Inflectional morphology: vary the form of words in order to express
grammatical features, such as singular/plural or past/present tense. E.g.Boy →
boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is
formed from create , but they are two separate words. And also, destruction
→ destroy
 Stemming is language dependent
 Correct stemming is language specific and can be complex.
for example compressed and
compression are both accepted.
for example compress and
compress are both accept
36
Stemming
The final output from a conflation algorithm is a set of classes, one for
each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes
(i.e., prefixes and/or suffixes).
Example:‘connect’ is the stem for {connected, connecting connection,
connections}
Thus, [automate, automatic, automation] all reduce to 
automat
A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
 A document representative then becomes a list of class names, which are
often referred as the documents index terms/keywords.
Queries : Queries are handled in the same way.
37
Ways to implement stemming
There are basically two ways to implement stemming.
The first approach is to create a big dictionary that maps words to
their stems.
 The advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
The second approach is to use a set of rules that extract stems
from words.
 The advantages of this approach are that the code is typically small, and it
can gracefully handle new words; the disadvantage is that it occasionally
makes mistakes.
 But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
38
Assumptions in Stemming
 Words with the same stem are semantically
related and have the same meaning to the user of
the text
 The chance of matching increase when the index
term are reduced to their word stems because it
is normal to search using “ retrieve” than
“retrieving”
39
Stemming Errors
 Under-stemming
 the error of taking off too small a suffix
 croulons  croulon
 since croulons is a form of the verb crouler
 Over-stemming
 the error of taking off too much
 example: croûtons  croût
 since croûtons is the plural of croûton
 Miss-stemming
 taking off what looks like an ending, but is really part of the stem
 reply  rep
40
Criteria for judging stemmers
 Correctness
 Overstemming: too much of a term is removed.
 Can cause unrelated terms to be conflated  retrieval of non-relevant
documents
 Understemming: too little of a term is removed.
 Prevent related terms from being conflated  relevant documents may not be
retrieved
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)
41
Porter Stemmer
 Stemming is the operation of stripping the suffices from a word,
leaving its stem.
 Google, for instance, uses stemming to search for web pages containing the
words connected, connecting, connection and connections when users ask
for a web page that contains the word connect.
 In 1979, Martin Porter developed a stemming algorithm that uses a
set of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
 Porter describes his algorithm and provides a reference implementation in
C at http://tartarus.org/~martin/PorterStemmer/index.html
42
Porter stemmer
 Most common algorithm for stemming English words to their
common grammatical root
 It is simple procedure for removing known affixes in English
without using a dictionary.To gets rid of plurals the following rules
are used:
 SSES  SS caresses  caress
IES  y ponies  pony
SS  SS caress → caress
S   cats  cat
ment (Delete final element if what remains is longer than 1 character )
replacement  replace
cement  cement
43
Stemming: challenges
 May produce unusual stems that are not English words:
 Removing‘UAL’ from FACTUAL and EQUAL
 May conflate (reduce to the same token) words that are actually
distinct.
 “computer”,“computational”,“computation” all reduced to
same token “comput”
 Not recognize all morphological derivations.
44
Thesauri
 Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same concept
 Problem:The same meaning can be expressed using different terms that
are synonyms, homonyms, and related terms
 How can it be achieved such that for the same meaning the identical terms
are used in the index and the query?
 Thesaurus:The vocabulary of a controlled indexing
language, formally organized so that a priori relationships
between concepts (for example as "broader" and “related")
are made explicit.
 A thesaurus contains terms and relationships between terms
 IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT, and RT to demonstrate inter-term relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour,paint
45
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by showing a
set of related words to handle synonyms and homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
 Thesaurus rewrite to form equivalence classes, and we index such equivalences
 When the document contains automobile, index it under car as well (usually, also
vice-versa)
 to assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query
 to provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs
46
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
RoadTransport
47
More Example
Example: thesaurus built to assist IR in the fields of computer
science:
TERM: natural languages
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
48
Language-specificity
 Many of the above features embody transformations that are
 Language-specific and
 Often, application-specific
 These are “plug-in” addenda to the indexing process
 Both open source and commercial plug-ins are available for
handling these
49
INDEX TERM SELECTION
 Index language is the language used to describe documents and
requests
 Elements of the index language are index terms which may be
derived from the text of the document to be described, or may be
arrived at independently.
 If a full text representation of the text is adopted, then all words in
the text are used as index terms = full text indexing
 Otherwise, need to select the words to be used as index terms for
reducing the size of the index file which is basic to design an
efficient searching IR system
50
51
Chapter 2 Part II:
Term weighting and similarity measures
Terms
52
 Terms are usually stems.Terms can be also phrases, such as
“Computer Science”,“WorldWideWeb”, etc.
Documents as well as queries are represented as vectors or
“bags of words” (BOW).
 Each vector holds a place for every term in the collection.
 Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Wij=0 if a term is absent
 Documents are represented by binary weights or Non-
binary weighted vectors of terms.
...,
,
,...,
,
,
2
1
2
1
qn
q
q
d
d
d
i
w
w
w
Q
w
w
w
D in
i
i


Document Collection
53
A collection of n documents can be represented in the vector
space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Binary Weights
• Only the presence (1) or absence
(0) of a term is included in the
vector
• Binary formula gives every word
that appears in a document equal
relevance.
• It can be useful when frequency
is not important.
• Binary Weights Formula:
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
D11 1 0 1








0
if
0
0
if
1
ij
ij
ij
freq
freq
freq
Why use term weighting?
55
 Binary weights are too limiting.
 terms are either present or absent.
 Does NOT allow to order documents according to their
level of relevance for a given query
 Non-binary weights allow to model partial matching .
 Partial matching allows retrieval of docs that approximate
the query.
 Term-weighting improves quality of answer set
 Term weighting enables ranking of retrieved
documents; such that best matching documents are
ordered at the top as they are more relevant than others.
Term Weighting: Term Frequency (TF)
 TF (term frequency) - Count the
number of times term occurs in document.
fij = frequency of term i in doc j
 The more times a term t occurs in
document d the more likely it is that t is
relevant to the document, i.e. more
indicative of the topic..
 If used alone, it favors common words
and long documents.
 It gives too much credit to words that
appears more frequently.
 We may want to normalize term
frequency (tf) across the entire document:
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
D3 0 4 7
D4 3 0 0
D5 1 6 3
D6 3 5 0
D7 0 8 0
D8 0 10 0
D9 0 0 1
D10 0 3 5
D11 4 0 1
Q 1 0 1
q1 q2 q3
)
Max(tf
tf
tf
j
ij
ij 
Document Normalization
57
 Long documents have an unfair advantage:
 They use a lot of terms
 So they get more matches than short documents
 And they use the same words repeatedly
 So they have large term frequencies
 Normalization seeks to remove these effects:
 Related somehow to maximum term frequency.
 But also sensitive to the number of terms.
 What would happen if term frequency in a document
are not normalized?
 short documents may not be recognized as
relevant.
Problems with term frequency
 Need a mechanism for reduce the effect of terms that occur too
often in the collection to be meaningful for relevance/meaning
determination
 Scale down the term weight of terms with high collection
frequency
 Reduce the tf weight of a term by a factor that grows with the collection
frequency
 More common for this purpose is document frequency
 how many documents in the collection contain the term
58
• The example shows that collection
frequency (cf) and document
frequency (df) behaves differently
Document Frequency
59
 It is defined to be the number of documents in the collection
that contain a term
DF = document frequency
 Count the frequency considering the whole
collection of documents.
 Less frequently a term appears in the whole
collection, the more discriminating it is.
df i = document frequency of term i
= number of documents containing term i
Inverse Document Frequency (IDF)
60
 IDF measures how rare a term is in collection.
 The IDF is a measure of the general importance of the term
 Inverts the document frequency.
 It diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur
rarely.
 Gives full weight to terms that occur in one document only.
 Gives lowest weight to terms that occur in all documents. (log2(1)=0)
 Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
(N: total number of documents)
)
(
log2
i
i
df
N
idf 
Inverse Document Frequency
61
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• The log is used to reduce the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF idf
the 1000 1000 log2(1000/1000)=log2(1) =0
some 1000 100 log2(1000/100)=log2(10) = 3.322
car 1000 10 log2(1000/10)=log2(100) = 6.644
merge 1000 1 log2(1000/1)=log2(1000) = 9.966
TF*IDF Weighting
62
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij x idfi = tfij x log2 (N/ dfi)
A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
The tf-idf value for a term will always be greater than
or equal to zero.
 Experimentally, tf*idf has been found to work well.
 It is often used in the vector space model together
with cosine similarity to determine the similarity
between two documents.
TF*IDF weighting
 When does TF*IDF registers a high weight?
 when a term t occurs many times within a small number of
documents
 Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low document
frequency (in the whole collection of documents);
 the weights hence tend to filter out common terms thus giving
high discriminating power to those documents
 LowerTF*IDF is registered when the term occurs fewer times in a
document, or occurs in many documents
 Thus offering a less pronounced relevance signal
 LowestTF*IDF is registered when the term occurs in virtually all
documents
Computing TF-IDF: An Example
64
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms
are: A(50), B(1300), C(250).And also term frequencies (TF)
of these terms are:A(3), B(2), C(1).
ComputeTF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644;
tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322;
tf*idf = 1.774
 Query vector is typically treated as a document and also tf-idf
weighted.
More Example
65
 Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and cow
appears in one thousand of these.
 The term frequency (TF) for cow :
3/100 = 0.03
 The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
 TheTF*IDF score is the product of these frequencies: 0.03 * 13.228 =
0.39684
Exercise
66
• Let C = number of
times a given word
appears in a
document;
• TW = total number of
words in a document;
• TD = total number of
documents in a
corpus, and
• DF = total number of
documents containing
a given word;
• compute TF, IDF and
TF*IDF score for each
term
Word C TW TD DF TF IDF TFIDF
airplane 5 46 3 1
blue 1 46 3 1
chair 7 46 3 3
computer 3 46 3 1
forest 2 46 3 1
justice 7 46 3 3
love 2 46 3 1
might 2 46 3 1
perl 5 46 3 2
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
Concluding remarks
67
 Suppose from a set of English documents, we wish to determine which once
are the most relevant to the query "the brown cow."
 A simple way to start out is by eliminating documents that do not contain all
three words "the," "brown," and "cow," but this still leaves many documents.
 To further distinguish them, we might count the number of times each term
occurs in each document and sum them all together;
 the number of times a term occurs in a document is called itsTF. However, because
the term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more, without giving enough weight to the
more meaningful terms "brown" and "cow".
 Also the term "the" is not a good keyword to distinguish relevant and non-relevant
documents and terms like "brown" and "cow" that occur rarely are good keywords
to distinguish relevant documents from the non-relevant once.
Concluding remarks
 Hence IDF is incorporated which diminishes the weight of
terms that occur very frequently in the collection and increases
the weight of terms that occur rarely.
 This leads to useTF*IDF as a better weighting technique
 On top of that we apply similarity measures to calculate the
distance between document i and query j.
• There are a number of similarity measures; the most common similarity
measure are
Euclidean distance , Inner or Dot product, Cosine similarity, Dice
similarity, Jaccard similarity, etc.
Similarity Measure
69
 We now have vectors for all documents in the
collection, a vector for the query, how to
compute similarity?
 A similarity measure is a function that
computes the degree of similarity or distance
between document vector and query vector.
 Using a similarity measure between the query
and each document:
It is possible to rank the retrieved
documents in the order of presumed
relevance.
It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled (minimized).
2
t3
t1
t2
D1
D2
Q
1
Intuition
Postulate: Documents that are “close together” in the
vector space talk about the same things and more similar
than others.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Similarity Measure
71
Assumptions for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 is near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
 Sometimes it is a good idea to determine the maximum possible
similarity as the “distance” between a document d and itself
 A similarity measure attempts to compute the distance between
document vector wj and query vector wq.
 The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query
than documents whose vectors are away from the query
vector.
Similarity Measure: Techniques
72
• Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of a
pair of document and query terms.
• Dot product
 The dot product is also known as scalar product or inner
product
 the dot product is defined as the product of the magnitudes of
query and document vectors
Cosine similarity
 Also called normalized inner product
 It projects document and query vectors into a term space and
calculate the cosine angle between these.
Euclidean distance
73
 Similarity between vectors for the document di and query q can
be computed as:
sim(dj,q) = |dj – q| =
where wij is the weight of term i in document j and wiq is the
weight of term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means
corresponding term not found in document or query



n
i
iq
ij w
w
1
2
)
(
05
.
11
)
0
10
(
)
0
1
(
)
1
2
(
)
7
3
(
)
2
0
( 2
2
2
2
2











Inner Product
74
 Similarity between vectors for the document di and query q can
be computed as the vector inner product:
sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.


n
i 1
Properties of Inner Product
75
 Favors long documents with a large number of
unique terms.
Why?....
Again, the issue of normalization needs to be
considered
 Measures how many terms matched but not how
many terms are not matched.
 The unmatched terms will have value zero as either
the document vector or the query vector will have a
weight of ZERO
Inner Product -- Examples
Binary weight :
 Size of vector = size of vocabulary = 7
sim(D, Q) = 3
sim(D1 , Q) = 1*1 + 1*0 + 1*1 + 0*0 + 1*0 + 1*1 + 0*1 = 3
sim(D 2, Q) = 0*1 + 1*0 + 0*1 + 0*0 + 0*0 + 1*1 + 0*1 = 1
• Term Weighted:
sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12
sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Retrieval Database Term Computer Text Manage Data
D1 1 1 1 0 1 1 0
D2 0 1 0 0 0 1 0
Q 1 0 1 0 0 1 1
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1
k1 k2 k3 q  dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
d1
d2
d3
d4
d5
d6 d7
k1
k2
k3
77
d1
d2
d3
d4 d5
d6 d7
k1
k2
k3
Inner Product:
Exercise
k1 k2 k3 q  dj
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
78
Cosine similarity
 Measures similarity between d1 and d2 captured by the cosine of
the angle x between them.
 Or;
 The denominator involves the lengths of the vectors
 The cosine measure is also known as normalized inner product









n
i q
i
n
i j
i
n
i q
i
j
i
j
j
j
w
w
w
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
( 





n
i j
i
j w
d 1
2
,
Length










n
i k
i
n
i j
i
n
i i,k
i,j
k
j
k
j
k
j
w
w
w
w
d
d
d
d
d
d
sim
1
2
,
1
2
,
1
)
,
( 



Example: Computing Cosine Similarity
98
.
0
42
.
0
64
.
0
]
)
7
.
0
(
)
2
.
0
[(
*
]
)
8
.
0
(
)
4
.
0
[(
)
7
.
0
*
8
.
0
(
)
2
.
0
*
4
.
0
(
)
,
(
2
2
2
2
2






D
Q
sim
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity using
cosine?
Example: Computing Cosine Similarity
81
98
.
0
cos
74
.
0
cos
2
1




2

1
 1
D
Q
2
D
1.0
0.8
0.6
0.8
0.4
0.6
0.4 1.0
0.2
0.2
• Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4,
0.8), determine which document is most relevant for the
query?
Example
82
 Given three documents; D1,D2 and D3 with the corresponding
TF*IDF weight, Which documents are more similar using the
three measurement?
Terms D1 D2 D3
affection 2 3 1
Jealous 0 2 4
gossip 1 0 2
Cosine Similarity vs. Inner Product
83
 Cosine similarity measures the cosine of the angle between
two vectors.
 Inner product normalized by the vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5
times better using inner product.
 

 






t
i
t
i
t
i
w
w
w
w
q
d
q
d
iq
ij
iq
ij
j
j
1 1
2
2
1
)
(




CosSim(dj, q) =
q
dj



InnerProduct(dj, q) =
Exercises
84
 A database collection consists of 1 million documents, of which
200,000 contain the term holiday while 250,000 contain the
term season.A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than any other
term in the document. Calculate the weight of both terms in this
document using three different term weight methods.Try with
(i) normalized and unnormalized TF;
(ii)TF*IDF based on normalized and unnormalized TF
End of Chapter Two
85

More Related Content

What's hot

CIDOC CRM Tutorial
CIDOC CRM TutorialCIDOC CRM Tutorial
CIDOC CRM Tutorial
ISLCCIFORTH
 
XML Signature
XML SignatureXML Signature
XML Signature
Prabath Siriwardena
 
Signature files
Signature filesSignature files
Signature files
Deepali Raikar
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Inverted index
Inverted indexInverted index
Inverted index
Krishna Gehlot
 
Web ontology language (owl)
Web ontology language (owl)Web ontology language (owl)
Web ontology language (owl)
Ameer Sameer
 
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
confluent
 
CPM AND PERT NETWORK DIAGRAM
CPM AND PERT NETWORK DIAGRAMCPM AND PERT NETWORK DIAGRAM
CPM AND PERT NETWORK DIAGRAM
INDRAKUMAR PADWANI
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)
Jeet Das
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
Neelam Rawat
 
Description of soa and SOAP,WSDL & UDDI
Description of soa and SOAP,WSDL & UDDIDescription of soa and SOAP,WSDL & UDDI
Description of soa and SOAP,WSDL & UDDI
TUSHAR VARSHNEY
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
mohamed Elzalabany
 
Text mining
Text miningText mining
Text mining
ThejeswiniChivukula
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
AdonisDamian
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming Pattern
Marko Rodriguez
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
Bhaskar Mitra
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 

What's hot (20)

CIDOC CRM Tutorial
CIDOC CRM TutorialCIDOC CRM Tutorial
CIDOC CRM Tutorial
 
XML Signature
XML SignatureXML Signature
XML Signature
 
Signature files
Signature filesSignature files
Signature files
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Inverted index
Inverted indexInverted index
Inverted index
 
Web ontology language (owl)
Web ontology language (owl)Web ontology language (owl)
Web ontology language (owl)
 
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
The Big Picture: Monitoring and Orchestration of Your Microservices Landscape...
 
CPM AND PERT NETWORK DIAGRAM
CPM AND PERT NETWORK DIAGRAMCPM AND PERT NETWORK DIAGRAM
CPM AND PERT NETWORK DIAGRAM
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Description of soa and SOAP,WSDL & UDDI
Description of soa and SOAP,WSDL & UDDIDescription of soa and SOAP,WSDL & UDDI
Description of soa and SOAP,WSDL & UDDI
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
 
Text mining
Text miningText mining
Text mining
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming Pattern
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 

Similar to Chapter 2 Text Operation and Term Weighting.pdf

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
NLP
NLPNLP
NLP
NLPNLP
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
unyil96
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
bikashtaly
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
RajpootBhatti5
 
Textmining
TextminingTextmining
Textmining
sidhunileshwar
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
University of Bari (Italy)
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
text
texttext
text
nyomans1
 
Ir 03
Ir   03Ir   03
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
Editor IJMTER
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
The Usage of Because of-Words in British National Corpus
 The Usage of Because of-Words in British National Corpus The Usage of Because of-Words in British National Corpus
The Usage of Because of-Words in British National Corpus
Research Journal of Education
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
Sameer Wadkar
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
IJERA Editor
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
Constantin Orasan
 
Document similarity
Document similarityDocument similarity
Document similarity
Hemant Hatankar
 

Similar to Chapter 2 Text Operation and Term Weighting.pdf (20)

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Machine transliteration survey
Machine transliteration surveyMachine transliteration survey
Machine transliteration survey
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
Textmining
TextminingTextmining
Textmining
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
text
texttext
text
 
Ir 03
Ir   03Ir   03
Ir 03
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
The Usage of Because of-Words in British National Corpus
 The Usage of Because of-Words in British National Corpus The Usage of Because of-Words in British National Corpus
The Usage of Because of-Words in British National Corpus
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
 
Document similarity
Document similarityDocument similarity
Document similarity
 

Recently uploaded

PACKAGING OF FROZEN FOODS ( food technology)
PACKAGING OF FROZEN FOODS  ( food technology)PACKAGING OF FROZEN FOODS  ( food technology)
PACKAGING OF FROZEN FOODS ( food technology)
Addu25809
 
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
Joshua Orris
 
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
pjq9n1lk
 
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
astuz
 
world-environment-day-2024-240601103559-14f4c0b4.pptx
world-environment-day-2024-240601103559-14f4c0b4.pptxworld-environment-day-2024-240601103559-14f4c0b4.pptx
world-environment-day-2024-240601103559-14f4c0b4.pptx
mfasna35
 
Environment Conservation Rules 2023 (ECR)-2023.pptx
Environment Conservation Rules 2023 (ECR)-2023.pptxEnvironment Conservation Rules 2023 (ECR)-2023.pptx
Environment Conservation Rules 2023 (ECR)-2023.pptx
neilsencassidy
 
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENTBASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
AmitKumar619042
 
RoHS stands for Restriction of Hazardous Substances, which is also known as t...
RoHS stands for Restriction of Hazardous Substances, which is also known as t...RoHS stands for Restriction of Hazardous Substances, which is also known as t...
RoHS stands for Restriction of Hazardous Substances, which is also known as t...
vijaykumar292010
 
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
Joshua Orris
 
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
mvrpcz6
 
Lessons from operationalizing integrated landscape approaches
Lessons from operationalizing integrated landscape approachesLessons from operationalizing integrated landscape approaches
Lessons from operationalizing integrated landscape approaches
CIFOR-ICRAF
 
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
Open Access Research Paper
 
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
p2npnqp
 
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener FutureBiomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
Dr. P.B.Dharmasena
 
Improving the viability of probiotics by encapsulation methods for developmen...
Improving the viability of probiotics by encapsulation methods for developmen...Improving the viability of probiotics by encapsulation methods for developmen...
Improving the viability of probiotics by encapsulation methods for developmen...
Open Access Research Paper
 
Wildlife-AnIntroduction.pdf so that you know more about our environment
Wildlife-AnIntroduction.pdf so that you know more about our environmentWildlife-AnIntroduction.pdf so that you know more about our environment
Wildlife-AnIntroduction.pdf so that you know more about our environment
amishajha2407
 

Recently uploaded (16)

PACKAGING OF FROZEN FOODS ( food technology)
PACKAGING OF FROZEN FOODS  ( food technology)PACKAGING OF FROZEN FOODS  ( food technology)
PACKAGING OF FROZEN FOODS ( food technology)
 
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
Optimizing Post Remediation Groundwater Performance with Enhanced Microbiolog...
 
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
在线办理(lboro毕业证书)拉夫堡大学毕业证学历证书一模一样
 
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
快速办理(Calabria毕业证书)卡拉布里亚大学毕业证在读证明一模一样
 
world-environment-day-2024-240601103559-14f4c0b4.pptx
world-environment-day-2024-240601103559-14f4c0b4.pptxworld-environment-day-2024-240601103559-14f4c0b4.pptx
world-environment-day-2024-240601103559-14f4c0b4.pptx
 
Environment Conservation Rules 2023 (ECR)-2023.pptx
Environment Conservation Rules 2023 (ECR)-2023.pptxEnvironment Conservation Rules 2023 (ECR)-2023.pptx
Environment Conservation Rules 2023 (ECR)-2023.pptx
 
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENTBASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
BASIC CONCEPT OF ENVIRONMENT AND DIFFERENT CONSTITUTENET OF ENVIRONMENT
 
RoHS stands for Restriction of Hazardous Substances, which is also known as t...
RoHS stands for Restriction of Hazardous Substances, which is also known as t...RoHS stands for Restriction of Hazardous Substances, which is also known as t...
RoHS stands for Restriction of Hazardous Substances, which is also known as t...
 
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
Evolving Lifecycles with High Resolution Site Characterization (HRSC) and 3-D...
 
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
原版制作(Manitoba毕业证书)曼尼托巴大学毕业证学位证一模一样
 
Lessons from operationalizing integrated landscape approaches
Lessons from operationalizing integrated landscape approachesLessons from operationalizing integrated landscape approaches
Lessons from operationalizing integrated landscape approaches
 
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
Kinetic studies on malachite green dye adsorption from aqueous solutions by A...
 
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
原版制作(Newcastle毕业证书)纽卡斯尔大学毕业证在读证明一模一样
 
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener FutureBiomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
Biomimicry in agriculture: Nature-Inspired Solutions for a Greener Future
 
Improving the viability of probiotics by encapsulation methods for developmen...
Improving the viability of probiotics by encapsulation methods for developmen...Improving the viability of probiotics by encapsulation methods for developmen...
Improving the viability of probiotics by encapsulation methods for developmen...
 
Wildlife-AnIntroduction.pdf so that you know more about our environment
Wildlife-AnIntroduction.pdf so that you know more about our environmentWildlife-AnIntroduction.pdf so that you know more about our environment
Wildlife-AnIntroduction.pdf so that you know more about our environment
 

Chapter 2 Text Operation and Term Weighting.pdf

  • 2. Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary size grow with the size of a corpus?  Such factors affect the performance of IR system & can be used to select suitable term weights & other aspects of the system.  A few words are very common.  2 most frequent words (e.g.“the”,“of”) can account for about 10% of word occurrences in a document.  Most words are very rare.  Half the words in a corpus appear only once, called “read only once”  Called a “heavy tailed” distribution, since most of the probability mass is in the “tail” 2
  • 4. Zipf’s distributions Rank Frequency Distribution For all the words in a documents collection, for each word w • f : is the frequency that w appears • r : is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f r w has rank r and frequency f Distribution of sorted word frequencies, according to Zipf’s law 4
  • 5. Word distribution: Zipf's Law  Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950),  attempts to capture the distribution of the frequencies (i.e. , number of occurances ) of the words within a text.  Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their frequency of occuerence (most frequent words first), the occurence characterstics of the vocabulary can be characterized by the constant rank-frequency law of Zipf: Frequency * Rank = constant  If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c  Different collections have different constants c. 5
  • 6. Example: Zipf's Law  The table shows the most frequently occurring words from 336,310 document collection containing 125, 720, 891 total words; out of which there are 508,209 unique words 6
  • 7. Zipf’s law: modeling word distribution  The collection frequency of the ith most common term is proportional to 1/i  If the most frequent term occurs f1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc  Zipf's Law states that the frequency of the i-th most frequent word is 1/iӨ times that of the most frequent word  occurrence of some event ( P ), as a function of the rank (i) when the rank is determined by the frequency of occurrence, is a power-law function Pi ~ 1/i Ө with the exponent Ө close to unity. i fi 1  7
  • 8. More Example: Zipf’s Law  Illustration of Rank-Frequency Law. Let the total number of word occurrences in the sample N = 1,000,000 Rank (R) Term Frequency (F) R.(F/N) 1 the 69 971 0.070 2 of 36 411 0.073 3 and 28 852 0.086 4 to 26 149 0.104 5 a 23237 0.116 6 in 21341 0.128 7 that 10595 0.074 8 is 10099 0.081 9 was 9816 0.088 10 he 9543 0.095 8
  • 9. Methods that Build on Zipf's Law • Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. • Significant words: Take words in between the most frequent (upper cut-off) and least frequent words (lower cut-off). • Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighted less. Used by almost all ranking methods. 9
  • 10. Explanations for Zipf’s Law  The law has been explained by “principle of least effort” which makes it easier for a speaker or writer of a language to repeat certain words instead of coining new and different words.  Zipf’s explanation was his “principle of least effort” which balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Zipf’s Law Impact on IR  Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs.  Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare. 10
  • 11. Word significance: Luhn’s Ideas Luhn Idea (1958): the frequency of word occurrence in a text furnishes a useful measurement of word significance. Luhn suggested that both extremely common and extremely uncommon words were not very useful for indexing. 11
  • 12. For this, Luhn specifies two cutoff points: an upper and a lower cutoffs based on which non-significant words are excluded  The words exceeding the upper cutoff were considered to be common  The words below the lower cutoff were considered to be rare  Hence they are not contributing significantly to the content of the text  The ability of words to discriminate content, reached a peak at a rank order position half way between the two-cutoffs  Let f be the frequency of occurrence of words in a text, and r their rank in decreasing order of word frequency, then a plot relating f & r yields the following curve Word significance: Luhn’s Ideas 12
  • 13. Luhn’s Ideas Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for document representation & indexing. 13
  • 14. Vocabulary size : Heaps’ Law  Heap’s law: How to estimates the number of vocabularies in a given corpus  Dictionaries  600,000 words and above  But they do not include names of people, locations, products etc 14
  • 15. Vocabulary Growth: Heaps’ Law  How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?  This determines how the size of the inverted index will scale with the size of the corpus.  Heap’s law: estimates the number of vocabularies in a given corpus  The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.  If V is the size of the vocabulary and n is the length of the corpus in words, Heap’s provides the following equation:  Where constants:  K  10100    0.40.6 (approx. square-root)  Kn V  15
  • 16. Heap’s distributions • Distribution of size of the vocabulary: there is a linear relationship between vocabulary size and number of tokens • Example: from 1,000,000,000 words, there may be 1,000,000 distinct words. Can you agree? 16
  • 17. Example  We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However, we only know the statistics computed on smaller corpora sizes:  For 100,000 words, there are 50,000 unique words  For 500,000 words, there are 150,000 unique words  Estimate the vocabulary size for the 1,000,000 words corpus?  How about for a corpus of 1,000,000,000 words? 17
  • 18. Relevance Measure irrelevant retrieved & irrelevant not retrieved & irrelevant relevant retrieved & relevant not retrieved but relevant retrieved not retrieved 18
  • 19. Issues: recall and precision  breaking up hyphenated terms increase recall but decrease precision  preserving case distinctions enhance precision but decrease recall  commercial information systems usually take recall enhancing approach (numbers and words containing digits are index terms, and all are case insensitive) 19
  • 20. Text Operations  Not all words in a document are equally significant to represent the contents/meanings of a document  Some word carry more meaning than others  Noun words are the most representative of a document content  Therefore, need to preprocess the text of a document in a collection to be used as index terms  Using the set of all words in a collection to index documents creates too much noise for the retrieval task  Reduce noise means reduce words which can be used to refer to the document  Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words used as index terms  Preprocessing will lead to an improvement in the information retrieval performance  However, some search engines on theWeb omit preprocessing  Every word in the document is an index term 20
  • 21. Text Operations  Text operations is the process of text transformations in to logical representations  The main operations for selecting index terms, i.e. to choose words/stems (or groups of words) to be used as indexing terms:  Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the case of letters  Elimination of stop words - filter out words which are not useful in the retrieval process  Stemming words - remove affixes (prefixes and suffixes)  Construction of term categorization structures such as thesaurus, to capture relationship for allowing the expansion of the original query with related terms 21
  • 22. Generating Document Representatives  Text Processing System  Input text – full text, abstract or title  Output – a document representative adequate for use in an automatic retrieval system  The document representative consists of a list of class names, each name representing a class of words occurring in the total input text. A document will be indexed by a name if one of its significant words occurs as a member of that class. Tokenization stemming Thesaurus Index terms stop words documents 22
  • 23. Lexical Analysis/Tokenization of Text  Change text of the documents into words to be adopted as index terms  Objective - identify words in the text  Digits, hyphens, punctuation marks, case of letters  Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique  Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but some words, e.g. gilt-edged, B-49 - unique words which require hyphens  Punctuation marks – remove totally unless significant, e.g. program code: x.exe and xexe  Case of letters – not important and can convert all to upper or lower 23
  • 24. Tokenization  Analyze text into a sequence of discrete tokens (words).  Input:“Friends,Romans and Countrymen”  Output:Tokens (an instance of a sequence of characters that are grouped together as a useful semantic unit for processing)  Friends  Romans  and  Countrymen  Each such token is now a candidate for an index entry, after further processing  But what are valid tokens to emit? 24
  • 25. Issues in Tokenization  One word or multiple: How do you decide it is one token or two or more? Hewlett-Packard  Hewlett and Packard as two tokens?  state-of-the-art: break up hyphenated sequence.  San Francisco,LosAngeles lowercase, lower-case, lower case ?  data base, database, data-base • Numbers:  dates (3/12/91 vs. Mar. 12, 1991);  phone numbers,  IP addresses (100.2.86.144) 25
  • 26. Issues in Tokenization  How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, …  Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.  However, frequently they are not.  Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.  Generally, don’t index numbers as text, But often very useful.Will often index “meta-data” , including creation date, format, etc. separately  Issues of tokenization are language specific  Requires the language to be known  The application area for which the IR system would be developed for also dictates the nature of valid tokens 26
  • 27. Exercise: Tokenization  The cat slept peacefully in the living room. It’s a very old cat.  Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. 27
  • 28. Elimination of STOPWORD  Stopwords are extremely common words across document collections that have no discriminatory power  They may occur in 80% of the documents in a collection.  They would appear to be of little value in helping select documents matching a user need and needs to be filtered out from index list  Examples of stopwords:  articles (a, an, the);  pronouns: (I, he, she, it, their, his)  prepositions (on, of, in, about, besides, against),  conjunctions/ connectors (and, but, for, nor, or, so, yet),  verbs (is, are, was, were),  adverbs (here, there, out, because, soon, after) and  adjectives (all, any, each, every, few, many, some)  Stopwords are language dependent. 28
  • 29. Stopwords Intuition: Stopwords have little semantic content; It is typical to remove such high-frequency words Stopwords take up 50% of the text. Hence, document size reduces by 30-50% Smaller indices for information retrieval  Good compression techniques for indices:The 30 most common words account for 30% of the tokens in written text  Better approximation of importance for classification, summarization, etc. 29
  • 30. How to determine a list of stopwords?  One method: Sort terms (in decreasing order) by collection frequency and take the most frequent ones In a collection about insurance practices,“insurance” would be a stop word Another method: Build a stop word list that contains a set of articles, pronouns, etc.  Why do we need stop lists: With a stop list, we can compare and exclude from index terms entirely the commonest words.  With the removal of stopwords, we can measure better approximation of importance for classification, summarization, etc. 30
  • 31. Stop words  Stop word elimination used to be standard in older IR systems.  But the trend is away from doing this. Most web search engines index stop words: Good query optimization techniques mean you pay little at query time for including stop words. You need stopwords for:  Phrase queries:“King of Denmark”  Various song titles, etc.:“Let it be”,“To be or not to be”  “Relational” queries:“flights to London”  Elimination of stopwords might reduce recall (e.g.“To be or not to be” – all eliminated except “be” – no or irrelevant retrieval) 31
  • 32. Normalization • It is Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens  Need to “normalize” terms in indexed text as well as query terms into the same form  Example:We want to match U.S.A. and USA,by deleting periods in a term  Case Folding: Often best to lower case everything, since users will use lowercase regardless of‘correct’ capitalization…  Republican vs. republican  Fasil vs. fasil vs.FASIL  Anti-discriminatory vs. antidiscriminatory  Car vs. automobile? 32
  • 33. Normalization issues  Good for  Allow instances of Automobile at the beginning of a sentence to match with a query of automobile  Helps a search engine when most users type ferrari when they are interested in a Ferrari car  Bad for  Proper names vs. common nouns  E.g. General Motors,Associated Press, …  Solution:  lowercase only words at the beginning of the sentence  In IR, lowercasing is most practical because of the way users issue their queries 33
  • 34. Term conflation  One of the problems involved in the use of free text for indexing and retrieval is the variation in word forms that is likely to be encountered.  The most common types of variations are  spelling errors (father, fathor)  alternative spellings i.e. locality or national usage (color vs colour, labor vs labour)  multi-word concepts (database, data base)  affixes (dependent, independent, dependently)  abbreviations (i.e., that is). 34
  • 35. Function of conflation in terms of IR  Reducing the total number of distinct terms to a consequent reduction of dictionary size and updating problems Less terms to worry about  Bringing similar words, having similar meanings to a common form with the aim of increasing retrieval effectiveness. More accurate statistics 35
  • 36. Stemming/Morphological analysis Stemming reduces tokens to their “root” form of words to recognize morphological variation . The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of reducing variants to the same stem Often removes inflectional and derivational morphology of a word  Inflectional morphology: vary the form of words in order to express grammatical features, such as singular/plural or past/present tense. E.g.Boy → boys, cut → cutting.  Derivational morphology: makes new words from old ones. E.g. creation is formed from create , but they are two separate words. And also, destruction → destroy  Stemming is language dependent  Correct stemming is language specific and can be complex. for example compressed and compression are both accepted. for example compress and compress are both accept 36
  • 37. Stemming The final output from a conflation algorithm is a set of classes, one for each stem detected. A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or suffixes). Example:‘connect’ is the stem for {connected, connecting connection, connections} Thus, [automate, automatic, automation] all reduce to  automat A class name is assigned to a document if and only if one of its members occurs as a significant word in the text of the document.  A document representative then becomes a list of class names, which are often referred as the documents index terms/keywords. Queries : Queries are handled in the same way. 37
  • 38. Ways to implement stemming There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems.  The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words.  The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes.  But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen. 38
  • 39. Assumptions in Stemming  Words with the same stem are semantically related and have the same meaning to the user of the text  The chance of matching increase when the index term are reduced to their word stems because it is normal to search using “ retrieve” than “retrieving” 39
  • 40. Stemming Errors  Under-stemming  the error of taking off too small a suffix  croulons  croulon  since croulons is a form of the verb crouler  Over-stemming  the error of taking off too much  example: croûtons  croût  since croûtons is the plural of croûton  Miss-stemming  taking off what looks like an ending, but is really part of the stem  reply  rep 40
  • 41. Criteria for judging stemmers  Correctness  Overstemming: too much of a term is removed.  Can cause unrelated terms to be conflated  retrieval of non-relevant documents  Understemming: too little of a term is removed.  Prevent related terms from being conflated  relevant documents may not be retrieved Term Stem GASES GAS (correct) GAS GA (over) GASEOUS GASE (under) 41
  • 42. Porter Stemmer  Stemming is the operation of stripping the suffices from a word, leaving its stem.  Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when users ask for a web page that contains the word connect.  In 1979, Martin Porter developed a stemming algorithm that uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right.  Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html 42
  • 43. Porter stemmer  Most common algorithm for stemming English words to their common grammatical root  It is simple procedure for removing known affixes in English without using a dictionary.To gets rid of plurals the following rules are used:  SSES  SS caresses  caress IES  y ponies  pony SS  SS caress → caress S   cats  cat ment (Delete final element if what remains is longer than 1 character ) replacement  replace cement  cement 43
  • 44. Stemming: challenges  May produce unusual stems that are not English words:  Removing‘UAL’ from FACTUAL and EQUAL  May conflate (reduce to the same token) words that are actually distinct.  “computer”,“computational”,“computation” all reduced to same token “comput”  Not recognize all morphological derivations. 44
  • 45. Thesauri  Mostly full-text searching cannot be accurate, since different authors may select different words to represent the same concept  Problem:The same meaning can be expressed using different terms that are synonyms, homonyms, and related terms  How can it be achieved such that for the same meaning the identical terms are used in the index and the query?  Thesaurus:The vocabulary of a controlled indexing language, formally organized so that a priori relationships between concepts (for example as "broader" and “related") are made explicit.  A thesaurus contains terms and relationships between terms  IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used for), BT, and RT to demonstrate inter-term relationships.  e.g., car = automobile, truck, bus, taxi, motor vehicle -color = colour,paint 45
  • 46. Aim of Thesaurus Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle synonyms and homonyms The aim of thesaurus is therefore:  to provide a standard vocabulary for indexing and searching  Thesaurus rewrite to form equivalence classes, and we index such equivalences  When the document contains automobile, index it under car as well (usually, also vice-versa)  to assist users with locating terms for proper query formulation: When the query contains automobile, look under car as well for expanding query  to provide classified hierarchies that allow the broadening and narrowing of the current request according to user needs 46
  • 47. Thesaurus Construction Example: thesaurus built to assist IR for searching cars and vehicles : Term: Motor vehicles UF : Automobiles Cars Trucks BT: Vehicles RT: Road Engineering RoadTransport 47
  • 48. More Example Example: thesaurus built to assist IR in the fields of computer science: TERM: natural languages  UF natural language processing (UF=used for NLP)  BT languages (BT=broader term is languages)  TT languages (TT = top term is languages)  RT artificial intelligence (RT=related term/s) computational linguistic formal languages query languages speech recognition 48
  • 49. Language-specificity  Many of the above features embody transformations that are  Language-specific and  Often, application-specific  These are “plug-in” addenda to the indexing process  Both open source and commercial plug-ins are available for handling these 49
  • 50. INDEX TERM SELECTION  Index language is the language used to describe documents and requests  Elements of the index language are index terms which may be derived from the text of the document to be described, or may be arrived at independently.  If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexing  Otherwise, need to select the words to be used as index terms for reducing the size of the index file which is basic to design an efficient searching IR system 50
  • 51. 51 Chapter 2 Part II: Term weighting and similarity measures
  • 52. Terms 52  Terms are usually stems.Terms can be also phrases, such as “Computer Science”,“WorldWideWeb”, etc. Documents as well as queries are represented as vectors or “bags of words” (BOW).  Each vector holds a place for every term in the collection.  Position 1 corresponds to term 1, position 2 to term 2, position n to term n. Wij=0 if a term is absent  Documents are represented by binary weights or Non- binary weighted vectors of terms. ..., , ,..., , , 2 1 2 1 qn q q d d d i w w w Q w w w D in i i  
  • 53. Document Collection 53 A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
  • 54. Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector • Binary formula gives every word that appears in a document equal relevance. • It can be useful when frequency is not important. • Binary Weights Formula: docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 0 1 D10 0 1 1 D11 1 0 1         0 if 0 0 if 1 ij ij ij freq freq freq
  • 55. Why use term weighting? 55  Binary weights are too limiting.  terms are either present or absent.  Does NOT allow to order documents according to their level of relevance for a given query  Non-binary weights allow to model partial matching .  Partial matching allows retrieval of docs that approximate the query.  Term-weighting improves quality of answer set  Term weighting enables ranking of retrieved documents; such that best matching documents are ordered at the top as they are more relevant than others.
  • 56. Term Weighting: Term Frequency (TF)  TF (term frequency) - Count the number of times term occurs in document. fij = frequency of term i in doc j  The more times a term t occurs in document d the more likely it is that t is relevant to the document, i.e. more indicative of the topic..  If used alone, it favors common words and long documents.  It gives too much credit to words that appears more frequently.  We may want to normalize term frequency (tf) across the entire document: docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1 Q 1 0 1 q1 q2 q3 ) Max(tf tf tf j ij ij 
  • 57. Document Normalization 57  Long documents have an unfair advantage:  They use a lot of terms  So they get more matches than short documents  And they use the same words repeatedly  So they have large term frequencies  Normalization seeks to remove these effects:  Related somehow to maximum term frequency.  But also sensitive to the number of terms.  What would happen if term frequency in a document are not normalized?  short documents may not be recognized as relevant.
  • 58. Problems with term frequency  Need a mechanism for reduce the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination  Scale down the term weight of terms with high collection frequency  Reduce the tf weight of a term by a factor that grows with the collection frequency  More common for this purpose is document frequency  how many documents in the collection contain the term 58 • The example shows that collection frequency (cf) and document frequency (df) behaves differently
  • 59. Document Frequency 59  It is defined to be the number of documents in the collection that contain a term DF = document frequency  Count the frequency considering the whole collection of documents.  Less frequently a term appears in the whole collection, the more discriminating it is. df i = document frequency of term i = number of documents containing term i
  • 60. Inverse Document Frequency (IDF) 60  IDF measures how rare a term is in collection.  The IDF is a measure of the general importance of the term  Inverts the document frequency.  It diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely.  Gives full weight to terms that occur in one document only.  Gives lowest weight to terms that occur in all documents. (log2(1)=0)  Terms that appear in many different documents are less indicative of overall topic. idfi = inverse document frequency of term i, (N: total number of documents) ) ( log2 i i df N idf 
  • 61. Inverse Document Frequency 61 • IDF provides high values for rare words and low values for common words. • IDF is an indication of a term’s discrimination power. • The log is used to reduce the effect relative to tf. • Make the difference between Document frequency vs. corpus frequency ? • E.g.: given a collection of 1000 documents and document frequency, compute IDF for each word? Word N DF idf the 1000 1000 log2(1000/1000)=log2(1) =0 some 1000 100 log2(1000/100)=log2(10) = 3.322 car 1000 10 log2(1000/10)=log2(100) = 6.644 merge 1000 1 log2(1000/1)=log2(1000) = 9.966
  • 62. TF*IDF Weighting 62 The most used term-weighting is tf*idf weighting scheme: wij = tfij x idfi = tfij x log2 (N/ dfi) A term occurring frequently in the document but rarely in the rest of the collection is given high weight. The tf-idf value for a term will always be greater than or equal to zero.  Experimentally, tf*idf has been found to work well.  It is often used in the vector space model together with cosine similarity to determine the similarity between two documents.
  • 63. TF*IDF weighting  When does TF*IDF registers a high weight?  when a term t occurs many times within a small number of documents  Highest tf*idf for a term shows a term has a high term frequency (in the given document) and a low document frequency (in the whole collection of documents);  the weights hence tend to filter out common terms thus giving high discriminating power to those documents  LowerTF*IDF is registered when the term occurs fewer times in a document, or occurs in many documents  Thus offering a less pronounced relevance signal  LowestTF*IDF is registered when the term occurs in virtually all documents
  • 64. Computing TF-IDF: An Example 64 Assume collection contains 10,000 documents and statistical analysis shows that document frequencies (DF) of three terms are: A(50), B(1300), C(250).And also term frequencies (TF) of these terms are:A(3), B(2), C(1). ComputeTF*IDF for each term? A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644 B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962 C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774  Query vector is typically treated as a document and also tf-idf weighted.
  • 65. More Example 65  Consider a document containing 100 words wherein the word cow appears 3 times. Now, assume we have 10 million documents and cow appears in one thousand of these.  The term frequency (TF) for cow : 3/100 = 0.03  The inverse document frequency is log2(10,000,000 / 1,000) = 13.228  TheTF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684
  • 66. Exercise 66 • Let C = number of times a given word appears in a document; • TW = total number of words in a document; • TD = total number of documents in a corpus, and • DF = total number of documents containing a given word; • compute TF, IDF and TF*IDF score for each term Word C TW TD DF TF IDF TFIDF airplane 5 46 3 1 blue 1 46 3 1 chair 7 46 3 3 computer 3 46 3 1 forest 2 46 3 1 justice 7 46 3 3 love 2 46 3 1 might 2 46 3 1 perl 5 46 3 2 rose 6 46 3 3 shoe 4 46 3 1 thesis 2 46 3 2
  • 67. Concluding remarks 67  Suppose from a set of English documents, we wish to determine which once are the most relevant to the query "the brown cow."  A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents.  To further distinguish them, we might count the number of times each term occurs in each document and sum them all together;  the number of times a term occurs in a document is called itsTF. However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and "cow".  Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant once.
  • 68. Concluding remarks  Hence IDF is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely.  This leads to useTF*IDF as a better weighting technique  On top of that we apply similarity measures to calculate the distance between document i and query j. • There are a number of similarity measures; the most common similarity measure are Euclidean distance , Inner or Dot product, Cosine similarity, Dice similarity, Jaccard similarity, etc.
  • 69. Similarity Measure 69  We now have vectors for all documents in the collection, a vector for the query, how to compute similarity?  A similarity measure is a function that computes the degree of similarity or distance between document vector and query vector.  Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of presumed relevance. It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled (minimized). 2 t3 t1 t2 D1 D2 Q 1
  • 70. Intuition Postulate: Documents that are “close together” in the vector space talk about the same things and more similar than others. t1 d2 d1 d3 d4 d5 t3 t2 θ φ
  • 71. Similarity Measure 71 Assumptions for proximity 1. If d1 is near d2, then d2 is near d1. 2. If d1 is near d2, and d2 near d3, then d1 is not far from d3. 3. No document is closer to d than d itself.  Sometimes it is a good idea to determine the maximum possible similarity as the “distance” between a document d and itself  A similarity measure attempts to compute the distance between document vector wj and query vector wq.  The assumption here is that documents whose vectors are close to the query vector are more relevant to the query than documents whose vectors are away from the query vector.
  • 72. Similarity Measure: Techniques 72 • Euclidean distance It is the most common similarity measure. Euclidean distance examines the root of square differences between coordinates of a pair of document and query terms. • Dot product  The dot product is also known as scalar product or inner product  the dot product is defined as the product of the magnitudes of query and document vectors Cosine similarity  Also called normalized inner product  It projects document and query vectors into a term space and calculate the cosine angle between these.
  • 73. Euclidean distance 73  Similarity between vectors for the document di and query q can be computed as: sim(dj,q) = |dj – q| = where wij is the weight of term i in document j and wiq is the weight of term i in the query • Example: Determine the Euclidean distance between the document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means corresponding term not found in document or query    n i iq ij w w 1 2 ) ( 05 . 11 ) 0 10 ( ) 0 1 ( ) 1 2 ( ) 7 3 ( ) 2 0 ( 2 2 2 2 2           
  • 74. Inner Product 74  Similarity between vectors for the document di and query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq where wij is the weight of term i in document j and wiq is the weight of term i in the query  For binary vectors, the inner product is the number of matched query terms in the document (size of intersection).  For weighted term vectors, it is the sum of the products of the weights of the matched terms.   n i 1
  • 75. Properties of Inner Product 75  Favors long documents with a large number of unique terms. Why?.... Again, the issue of normalization needs to be considered  Measures how many terms matched but not how many terms are not matched.  The unmatched terms will have value zero as either the document vector or the query vector will have a weight of ZERO
  • 76. Inner Product -- Examples Binary weight :  Size of vector = size of vocabulary = 7 sim(D, Q) = 3 sim(D1 , Q) = 1*1 + 1*0 + 1*1 + 0*0 + 1*0 + 1*1 + 0*1 = 3 sim(D 2, Q) = 0*1 + 1*0 + 0*1 + 0*0 + 0*0 + 1*1 + 0*1 = 1 • Term Weighted: sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12 sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5 Retrieval Database Term Computer Text Manage Data D1 1 1 1 0 1 1 0 D2 0 1 0 0 0 1 0 Q 1 0 1 0 0 1 1 Retrieval Database Architecture D1 2 3 5 D2 3 7 1 Q 1 0 2
  • 77. Inner Product: Example 1 k1 k2 k3 q  dj d1 1 0 1 2 d2 1 0 0 1 d3 0 1 1 2 d4 1 0 0 1 d5 1 1 1 3 d6 1 1 0 2 d7 0 1 0 1 q 1 1 1 d1 d2 d3 d4 d5 d6 d7 k1 k2 k3 77
  • 78. d1 d2 d3 d4 d5 d6 d7 k1 k2 k3 Inner Product: Exercise k1 k2 k3 q  dj d1 1 0 1 ? d2 1 0 0 ? d3 0 1 1 ? d4 1 0 0 ? d5 1 1 1 ? d6 1 1 0 ? d7 0 1 0 ? q 1 2 3 78
  • 79. Cosine similarity  Measures similarity between d1 and d2 captured by the cosine of the angle x between them.  Or;  The denominator involves the lengths of the vectors  The cosine measure is also known as normalized inner product          n i q i n i j i n i q i j i j j j w w w w q d q d q d sim 1 2 , 1 2 , 1 , , ) , (       n i j i j w d 1 2 , Length           n i k i n i j i n i i,k i,j k j k j k j w w w w d d d d d d sim 1 2 , 1 2 , 1 ) , (    
  • 80. Example: Computing Cosine Similarity 98 . 0 42 . 0 64 . 0 ] ) 7 . 0 ( ) 2 . 0 [( * ] ) 8 . 0 ( ) 4 . 0 [( ) 7 . 0 * 8 . 0 ( ) 2 . 0 * 4 . 0 ( ) , ( 2 2 2 2 2       D Q sim • Let say we have query vector Q = (0.4, 0.8); and also document D1 = (0.2, 0.7). Compute their similarity using cosine?
  • 81. Example: Computing Cosine Similarity 81 98 . 0 cos 74 . 0 cos 2 1     2  1  1 D Q 2 D 1.0 0.8 0.6 0.8 0.4 0.6 0.4 1.0 0.2 0.2 • Let say we have two documents in our corpus; D1 = (0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8), determine which document is most relevant for the query?
  • 82. Example 82  Given three documents; D1,D2 and D3 with the corresponding TF*IDF weight, Which documents are more similar using the three measurement? Terms D1 D2 D3 affection 2 3 1 Jealous 0 2 4 gossip 1 0 2
  • 83. Cosine Similarity vs. Inner Product 83  Cosine similarity measures the cosine of the angle between two vectors.  Inner product normalized by the vector lengths. D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.            t i t i t i w w w w q d q d iq ij iq ij j j 1 1 2 2 1 ) (     CosSim(dj, q) = q dj    InnerProduct(dj, q) =
  • 84. Exercises 84  A database collection consists of 1 million documents, of which 200,000 contain the term holiday while 250,000 contain the term season.A document repeats holiday 7 times and season 5 times. It is known that holiday is repeated more than any other term in the document. Calculate the weight of both terms in this document using three different term weight methods.Try with (i) normalized and unnormalized TF; (ii)TF*IDF based on normalized and unnormalized TF
  • 85. End of Chapter Two 85