Introduction to Information
Retrieval System-ITDLO7024
Prof.V.E.Pawar
BVCOENM
1
Department of Information Techology
Vision:
To create professionally competent engineers in
the field of Information Technology to achieve
social transformation through technology.
Mission:
• To impart quality education by keeping pace
with rapidly changing technology.
• To encourage young minds for research and
entrepreneurship.
Program Educational
Objectives(PEO):
• To train learners with solid foundation of
Information Technology analyze, design
solutions for the real life problems.
• To inculcate the qualities of effective
communication skill, teamwork and ethical
responsibility to gain excellence in professional
career.
• To motivate learners for higher studies, research
activities and lifelong learning.
3
Program Outcomes(PO):
Po1:Engineering knowledge: Apply the knowledge of
mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of
complex engineering problems.
Po2:Problem analysis: Identify, formulate, review
research literature, and analyze complex engineering
problems reaching substantiated conclusions using
first principles of mathematics, natural sciences, and
engineering sciences.
Po3:Design/development of solutions: Design
solutions for complex engineering problems and
design system components or processes that meet the
specified needs with appropriate consideration for the
public health and safety, and the cultural, societal,
and environmental considerations.
Program Outcomes(PO):
PO4: Conduct investigations of complex problems: Use research-
based knowledge and research methods including design of
experiments, analysis and interpretation of data, and synthesis
of the information to provide valid conclusions.
PO5:Modern tool usage: Create, select, and apply appropriate
techniques, resources, and modern engineering and IT tools
including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO9: Individual and team work: Function effectively as an
individual, and as a member or leader in diverse teams, and in
multidisciplinary settings.
Program Outcomes(PO):
PO4: Conduct investigations of complex problems:
Use research-based knowledge and research methods
including design of experiments, analysis and
interpretation of data, and synthesis of the information
to provide valid conclusions.
PO5:Modern tool usage: Create, select, and apply
appropriate techniques, resources, and modern
engineering and IT tools including prediction and
modeling to complex engineering activities with an
understanding of the limitations.
PO9: Individual and team work: Function effectively as
an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
nformation retrieval (IR)
• in computing and information science is the process
of obtaining information system resources that are
relevant to an information need from a collection of
those resources. Searches can be based on full-text or
other content-based indexing. Information retrieval is
the science of searching for information in a
document, searching for documents themselves, and
also searching for the metadata that describes data,
and for databases of texts, images or sounds.
7
IRS
• An information retrieval (IR) system is a set of
algorithms that facilitate the relevance of displayed
documents to searched queries.
• In simple words, it works to sort and rank documents
based on the queries of a user.
8
Outline
• Motivation
• Retrival process
• Informtion system
• Defination
• Objectives
• Information Vs data Retrival
• Search engine ,browser
9
Motivation
• Information retrieval (IR) deals with the
representation, storage, organization of, and
access to information items.
• The representation and organization of the
information items should provide the user
with easy access to the information in which
he is interested.
10
Definations
• Information Retrieval system is a part and parcel
of communication system.
• The main objectives of Information retrieval is to
supply right information, to the hand of right user
at a right time.
• Various materials and methods are used for
retrieving our desired information.
• The term Information retrieval first introduced by
Calvin Mooers in 1951.
11
• Information retrieval is the activity of obtaining
information resources relevant to an information
need from a collection of information resources.
• It is a part of information science, which studies of
those activities relating to the retrieval of
information.
• Searches can be based on metadata or on full-text
indexing. We may define this process as;
•
12
Definations
• Various Author and books define information
retrieval variously. Some of them are as:
• According to Calvin Mooers (1951), “Searching
and retrieval of information from storage
according to specification by subject.”
13
Definations
• According to J.H. Shera,
• “Information Retrieval is the process of
locating and selecting data relevant to a given
requirement.”
14
Defination
• Oxford Dictionary define Information Retrieval
as,
• “The tracing and recovery of specific
information from stored data.”
15
Element of Information retrieval:
• a. Information carrier.
• b. Descriptor.
• c. Document address.
• d. Transmission of information.
16
Elelment
• Information carrier:
• Carrier is something that carries, hold or
conveys something. On the other hand,
Information Carrier is something that carriers
or store information. For example: Film,
Magnetic Tape, CD, DVD etc.
17
• Descriptor: Term or terminology that’s used
for to search information from storage is
known as Descriptor. It is known as key words
that we use for search information from a
storage device.
18
• Document address: Every document must
have an address that identifies the location of
that document. Here document address
means ISBN, ISSN, code number, call number,
shelf number or file number that helps us to
retrieve information.
19
• Transmission of Information: Transmission of
Information means to supply any document at
the hands of the users when needed.
Information retrieval system uses various
communication channel or networking tools
for doing this.
20
21
Basic concept
• The effective retrieval of relevant information
is directly affected both by the user task and
by the logical view of the documents adopted
by the retrieval system, as we now discuss.
•
22
Process of IR
23
Outline
• What is the IR problem
• How to organize an IR system? (Or the main
processes in IR)
• Indexing
• Retrieval
• System evaluation
• Some current research topics
24
The problem of IR
• Goal = find documents relevant to an information
need from a large document set
25
Document
collection
Info.
need
Query
Answer list
IR
system
Retrieval
Example
26
Google
Web
IR problem
• First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
• external attributes and internal attribute (content)
• Search by external attributes = Search in DB
• IR: search by content
27
Possible approaches
1.String matching (linear search in documents)
- Slow
- Difficult to improve
2.Indexing (*)
- Fast
- Flexible to further improvement
28
Indexing-based IR
Document Query
indexing
indexing indexing
indexing
(Query analysis)
Representation Representation
(keywords) Query (keywords)
evaluation
29
Main problems in IR
• Document and query indexing
– How to best represent their contents?
• Query evaluation (or retrieval process)
– To what extent does a document
correspond to a query?
• System evaluation
– How good is a system?
– Are the retrieved documents relevant?
(precision)
– Are all the relevant documents retrieved?
(recall)
30
Document indexing
• Goal = Find the important meanings and create an internal
representation
• Factors to consider:
– Accuracy to represent meanings (semantics)
– Exhaustiveness (cover all the contents)
– Facility for computer to manipulate
• What is the best representation of contents?
– Char. string (char trigrams): not precise enough
– Word: good coverage, not precise
– Phrase: poor coverage, more precise
– Concept: poor coverage, precise
31
Coverage
(Recall)
Accuracy
(Precision)
String Word Phrase Concept
Keyword selection and weighting
• How to select important keywords?
– Simple method: using middle-frequency words
32
Frequency/Informativity
frequency informativity
Max.
Min.
1 2 3 … Rank
• tf = term frequency
– frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) for the doc.
• df = document frequency
– no. of documents containing the term
– distribution of the term
• idf = inverse document frequency
– the unevenness of term distribution in the corpus
– the specificity of term to a document
The more the term is distributed evenly, the less it is specific to a
document
weight(t,D) = tf(t,D) * idf(t)
33
tf*idf weighting schema
Some common tf*idf schemes
• tf(t, D)=freq(t,D) idf(t) = log(N/n)
• tf(t, D)=log[freq(t,D)] n = #docs containing t
• tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
• tf(t, D)=freq(t,d)/Max[f(t,d)]
weight(t,D) = tf(t,D) * idf(t)
• Normalization: Cosine normalization, /max, …
34
Document Length Normalization
• Sometimes, additional normalizations e.g.
length:
)
,
(
_
)
1
(
1
)
,
(
)
,
(
D
t
weight
normalized
povot
slope
slope
D
t
weight
D
t
pivoted




35
pivot
Probability
of relevance
Probability of retrieval
Doc. length
slope
Stopwords / Stoplist
• function words do not bear useful information for IR
of, in, about, with, I, although, …
• Stoplist: contain stopwords, not to be used as index
– Prepositions
– Articles
– Pronouns
– Some adverbs and adjectives
– Some frequent words (e.g. document)
• The removal of stopwords usually improves IR
effectiveness
• A few “standard” stoplists are commonly used.
36
Stemming
• Reason:
– Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
• Stemming:
– Removing some endings of word
computer
compute
computes
computing
computed
computation
37
comput
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix stripping, Program,
14(3) :130-137)
• Step 1: plurals and past participles
– SSES -> SS caresses -> caress
– (*v*) ING -> motoring -> motor
• Step 2: adj->n, n->v, n->adj, …
– (m>0) OUSNESS -> OUS callousness -> callous
– (m>0) ATIONAL -> ATE relational -> relate
• Step 3:
– (m>0) ICATE -> IC triplicate -> triplic
• Step 4:
– (m>1) AL -> revival -> reviv
– (m>1) ANCE -> allowance -> allow
• Step 5:
– (m>1) E -> probate -> probat
– (m > 1 and *d and *L) -> single letter controll -> control
38
Lemmatization
• transform to standard form according to syntactic
category.
E.g. verb + ing  verb
noun + s  noun
– Need POS tagging
– More accurate than stemming, but needs more resources
• crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
• compromise between precision and recall
light/no stemming severe stemming
-recall +precision +recall -precision
39
Result of indexing
• Each document is represented by a set of weighted
keywords (terms):
D1  {(t1, w1), (t2,w2), …}
e.g. D1  {(comput, 0.2), (architect, 0.3), …}
D2  {(comput, 0.1), (network, 0.5), …}
• Inverted file:
comput  {(D1,0.2), (D2,0.1), …}
Inverted file is used during retrieval for higher efficiency.
40
Retrieval
• The problems underlying retrieval
– Retrieval model
• How is a document represented with the selected
keywords?
• How are document and query representations
compared to calculate a score?
– Implementation
41
Cases
• 1-word query:
The documents to be retrieved are those that include the
word
- Retrieve the inverted list for the word
- Sort in decreasing order of the weight of the word
• Multi-word query?
- Combining several lists
- How to interpret the weight?
(IR model)
42
IR models
• Matching score model
– Document D = a set of weighted keywords
– Query Q = a set of non-weighted keywords
– R(D, Q) = i
w(ti
, D)
where ti
is in Q.
43
Boolean model
– Document = Logical conjunction of keywords
– Query = Boolean expression of keywords
– R(D, Q) = D Q
e.g. D = t1
 t2
 …  tn
Q = (t1
 t2
)  (t3
 t4
)
D Q, thus R(D, Q) = 1.
Problems:
– R is either 1 or 0 (unordered set of documents)
– many documents or few documents
– End-users cannot manipulate Boolean operators correctly
E.g. documents about kangaroos and koalas
44
Extensions to Boolean model
(for document ordering)
• D = {…, (ti
, wi
), …}: weighted keywords
• Interpretation:
– D is a member of class ti
to degree wi
.
– In terms of fuzzy sets: ti
(D) = wi
A possible Evaluation:
R(D, ti
) = ti
(D);
R(D, Q1
 Q2
) = min(R(D, Q1
), R(D, Q2
));
R(D, Q1
 Q2
) = max(R(D, Q1
), R(D, Q2
));
R(D, Q1
) = 1 - R(D, Q1
).
45
Vector space model
• Vector space = all the keywords encountered
<t1
, t2
, t3
, …, tn
>
• Document
D = < a1
, a2
, a3
, …, an
>
ai
= weight of ti
in D
• Query
Q = < b1
, b2
, b3
, …, bn
>
bi
= weight of ti
in Q
• R(D,Q) = Sim(D,Q) 46
Matrix representation
t1 t2 t3 … tn
D1 a11 a12 a13 … a1n
D2 a21 a22 a23 … a2n
D3 a31 a32 a33 … a3n
…
Dm am1 am2 am3 … amn
Q b1 b2 b3 … bn
47
Term vector
space
Document
space
Some formulas for Sim
Dot product
Cosine
Dice
Jaccard
48
  

 

 









i i i
i
i
i
i
i
i
i
i i
i
i
i
i
i
i i
i
i
i
i
i
i
i
b
a
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
Q
D
Sim
)
*
(
)
*
(
)
,
(
)
*
(
2
)
,
(
*
)
*
(
)
,
(
)
*
(
)
,
(
2
2
2
2
2
2
t1
t2
D
Q
Implementation (space)
• Matrix is very sparse: a few 100s terms for a
document, and a few terms for a query, while
the term space is large (~100k)
• Stored as:
D1  {(t1, a1), (t2,a2), …}
t1  {(D1,a1), …}
49
Implementation (time)
• The implementation of VSM with dot product:
– Naïve implementation: O(m*n)
– Implementation using inverted file:
Given a query = {(t1,b1), (t2,b2)}:
1. find the sets of related documents through inverted file for t1 and t2
2. calculate the score of the documents to each weighted term
(t1,b1)  {(D1,a1 *b1), …}
3. combine the sets and sum the weights ()
– O(|Q|*n)
50
Other similarities
• Cosine:
- use and to normalize the weights
after indexing
- Dot product
(Similar operations do not apply to Dice and
Jaccard)



 



j
j
i
i
j
j
i
j j
i
i
i
i
i
b
b
a
a
b
a
b
a
Q
D
Sim
2
2
2
2
*
)
*
(
)
,
(

j
j
b
2
51

j
j
a
2
Probabilistic model
• Given D, estimate P(R|D) and P(NR|D)
• P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
 P(D|R)
D = {t1=x1, t2=x2, …}
•




absent
present
xi
0
1























i
i
i
i
i
i
i
i
i
i
i
i
i
i
t
x
i
x
i
t
x
i
x
i
t
x
i
x
i
t
x
i
x
i
D
x
t
i
i
q
q
NR
t
P
NR
t
P
NR
D
P
p
p
R
t
P
R
t
P
R
x
t
P
R
D
P
)
1
(
)
1
(
)
1
(
)
1
(
)
(
)
1
(
)
|
0
(
)
|
1
(
)
|
(
)
1
(
)
|
0
(
)
|
1
(
)
|
(
)
|
(
52
Prob. model (cont’d)
)
1
(
)
1
(
log
1
1
log
)
1
(
)
1
(
log
)
1
(
)
1
(
log
)
|
(
)
|
(
log
)
( )
1
(
)
1
(
i
i
i
i
t
i
t i
i
i
i
i
i
t
i
t
x
i
x
i
t
x
i
x
i
p
q
q
p
x
q
p
p
q
q
p
x
q
q
p
p
NR
D
P
R
D
P
D
Odd
i
i
i
i
i
i
i
i
i




















53
For document ranking
Prob. model (cont’d)
• How to estimate pi and qi?
• A set of N relevant and
irrelevant samples:
ri
Rel. doc.
with ti
ni-ri
Irrel.doc.
with ti
ni
Doc.
with ti
Ri-ri
Rel. doc.
without
ti
N-Ri–n+ri
Irrel.doc.
without
ti
N-ni
Doc.
without
ti
Ri
Rel. doc
N-Ri
Irrel.doc.
N
Samples
i
i
i
i
i
i
i
R
N
r
n
q
R
r
p




54
Prob. model (cont’d)
• Smoothing (Robertson-Sparck-Jones formula)
• When no sample is available:
pi=0.5,
qi=(ni+0.5)/(N+0.5)ni/N
• May be implemented as VSM
)
)(
(
)
(
)
1
(
)
1
(
log
)
(
i
i
i
i
i
i
i
i
t
i
i
i
i
i
t
i
r
n
r
R
r
n
R
N
r
x
p
q
q
p
x
D
Odd
i
i












 











D
t
i
i
i
i
i
i
i
i
i
t
i
i
i
w
r
n
r
R
r
n
R
N
r
x
D
Odd
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
)
(
55
BM25
• k1, k2, k3, d: parameters
• qtf: query term frequency
• dl: document length
• avdl: average document length
)
)
1
((
|
|
)
1
(
)
1
(
)
,
(
1
2
3
3
1
dl
avdl
dl
b
b
k
K
dl
avdl
dl
avdl
Q
k
qtf
k
qtf
k
tf
K
tf
k
w
Q
D
Score
Q
t













56
(Classic) Presentation of results
• Query evaluation result is a list of documents, sorted
by their similarity to the query.
• E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…
57
System evaluation
• Efficiency: time, space
• Effectiveness:
– How is a system capable of retrieving relevant documents?
– Is a system better than another one?
• Metrics often used (together):
– Precision = retrieved relevant docs / retrieved docs
– Recall = retrieved relevant docs / relevant docs
relevant retrieved
58
retrieved relevant
General form of precision/recall
Precision
1.0
Recall
1.0
59
-Precision change w.r.t. Recall (not a fixed point)
-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
An illustration of P/R calculation
60
List Rel?
Doc1 Y
Doc2
Doc3 Y
Doc4 Y
Doc5
…
Precision
1.0 - * (0.2, 1.0)
0.8 - * (0.6, 0.75)
* (0.4, 0.67)
0.6 - * (0.6, 0.6)
* (0.2, 0.5)
0.4 -
0.2 -
0.0 | | | | | Recall
0.2 0.4 0.6 0.8 1.0
Assume: 5 relevant docs.
MAP (Mean Average Precision)
• rij = rank of the j-th relevant document for Qi
• |Ri| = #rel. doc. for Qi
• n = # test queries
• E.g. Rank: 1 4 1st
rel. doc.
5 8 2nd
rel. doc.
10 3rd
rel. doc.
 


i i
j
Q R
D ij
i r
j
R
n
MAP
|
|
1
1
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1





MAP
61
Some other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
• Fallout = retrieved irrel. docs / irrel. docs
• Single value measures:
– F-measure = 2 P * R / (P + R)
– Average precision = average at 11 points of recall
– Precision at n document (often used for Web IR)
– Expected search length (no. irrelevant documents to read before
obtaining n relevant doc.)
62
Test corpus
• Compare different IR systems on the same test
corpus
• A test corpus contains:
–A set of documents
–A set of queries
–Relevance judgment for every document-query pair
(desired answers for each query)
• The results of a system is compared with the
desired answers.
63
An evaluation example
(SMART)
Run number: 1 2
Num_queries: 52 52
Total number of documents over all
queries
Retrieved: 780 780
Relevant: 796 796
Rel_ret: 246 229
Recall - Precision Averages:
at 0.00 0.7695 0.7894
at 0.10 0.6618 0.6449
at 0.20 0.5019 0.5090
at 0.30 0.3745 0.3702
at 0.40 0.2249 0.3070
at 0.50 0.1797 0.2104
at 0.60 0.1143 0.1654
at 0.70 0.0891 0.1144
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
at 1.00 0.0699 0.0904
Average precision for all points
11-pt Avg: 0.2859 0.3092
% Change: 8.2
Recall:
Exact: 0.4139 0.4166
at 5 docs: 0.2373 0.2726
at 10 docs: 0.3254 0.3572
at 15 docs: 0.4139 0.4166
at 30 docs: 0.4139 0.4166
Precision:
Exact: 0.3154
0.2936
At 5 docs: 0.4308 0.4192
At 10 docs: 0.3538 0.3327
At 15 docs: 0.3154 0.2936
At 30 docs: 0.1577 0.1468
64
The TREC experiments
• Once per year
• A set of documents and queries are distributed to
the participants (the standard answers are
unknown) (April)
• Participants work (very hard) to construct, fine-tune
their systems, and submit the answers (1000/query)
at the deadline (July)
• NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July – August)
• TREC conference (November)
65
TREC evaluation methodology
• Known document collection (>100K) and query set (50)
• Submission of 1000 documents for each query by each
participant
• Merge 100 first documents of each participant -> global pool
• Human relevance judgment of the global pool
• The other documents are assumed to be irrelevant
• Evaluation of each system (with 1000 answers)
– Partial relevance judgments
– But stable for system ranking
66
Tracks (tasks)
• Ad Hoc track: given document collection, different topics
• Routing (filtering): stable interests (user profile), incoming
document flow
• CLIR: Ad Hoc, but with queries in a different language
• Web: a large set of Web pages
• Question-Answering: When did Nixon visit China?
• Interactive: put users into action with system
• Spoken document retrieval
• Image and video retrieval
• Information tracking: new topic / follow up
67
CLEF and NTCIR
• CLEF = Cross-Language Experimental Forum
– for European languages
– organized by Europeans
– Each per year (March – Oct.)
• NTCIR:
– Organized by NII (Japan)
– For Asian languages
– cycle of 1.5 year
68
Impact of TREC
• Provide large collections for further experiments
• Compare different systems/techniques on
realistic data
• Develop new methodology for system evaluation
• Similar experiments are organized in other areas
(NLP, Machine translation, Summarization, …)
69
Some techniques to improve
IR effectiveness
• Interaction with user (relevance feedback)
- Keywords only cover part of the contents
- User can help by indicating relevant/irrelevant
document
• The use of relevance feedback
– To improve query expression:
Qnew = *Qold + *Rel_d - *Nrel_d
where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
70
Effect of RF
71
* x * x x
* * * x x
* * R* Q * NR x
* x * x x
* * x
Qnew
* * *
* *
* *
* *
1st
retrieval
2nd
retrieval
Modified relevance feedback
• Users usually do not cooperate (e.g.
AltaVista in early years)
• Pseudo-relevance feedback (Blind RF)
– Using the top-ranked documents as if they are
relevant:
• Select m terms from n top-ranked documents
– One can usually obtain about 10% improvement
72
Query expansion
• A query contains part of the important words
• Add new (related) terms into the query
– Manually constructed knowledge base/thesaurus (e.g.
Wordnet)
• Q = information retrieval
• Q’ = (information + data + knowledge + …)
(retrieval + search + seeking + …)
– Corpus analysis:
• two terms that often co-occur are related (Mutual
information)
• Two terms that co-occur with the same words are related
(e.g. T-shirt and coat with wear, …) 73
Global vs. local context analysis
• Global analysis: use the whole document collection
to calculate term relationships
• Local analysis: use the query to retrieve a subset of
documents, then calculate term relationships
– Combine pseudo-relevance feedback and term co-
occurrences
– More effective than global analysis
74
Some current research topics:
Go beyond keywords
• Keywords are not perfect representatives of concepts
– Ambiguity:
table = data structure, furniture?
– Lack of precision:
“operating”, “system” less precise than “operating_system”
• Suggested solution
– Sense disambiguation (difficult due to the lack of contextual
information)
– Using compound terms (no complete dictionary of compound
terms, variation in form)
– Using noun phrases (syntactic patterns + statistics)
• Still a long way to go
75
Theory …
• Bayesian networks
–P(Q|D)
D1 D2 D3 … Dm
t1 t2 t3 t4 …. tn
c1 c2 c3 c4 … cl
Inference Q revision
• Language models
76
Logical models
• How to describe the relevance relation
as a logical relation?
D => Q
• What are the properties of this relation?
• How to combine uncertainty with a
logical framework?
• The problem: What is relevance?
77
Related applications:
Information filtering
• IR: changing queries on stable document collection
• IF: incoming document flow with stable interests (queries)
– yes/no decision (in stead of ordering documents)
– Advantage: the description of user’s interest may be improved
using relevance feedback (the user is more willing to cooperate)
– Difficulty: adjust threshold to keep/ignore document
– The basic techniques used for IF are the same as those for IR –
“Two sides of the same coin”
78
IF
… doc3, doc2, doc1
keep
ignore
User profile
IR for (semi-)structured documents
• Using structural information to assign weights to
keywords (Introduction, Conclusion, …)
– Hierarchical indexing
• Querying within some structure (search in title,
etc.)
– INEX experiments
• Using hyperlinks in indexing and retrieval (e.g.
Google)
• …
79
PageRank in Google
• Assign a numeric value to each page
• The more a page is referred to by important pages, the more this page
is important
• d: damping factor (0.85)
• Many other criteria: e.g. proximity of query words
– “…information retrieval …” better than “… information … retrieval …”




i i
i
I
C
I
PR
d
d
A
PR
)
(
)
(
)
1
(
)
(
80
A B
I1
I2
IR on the Web
• No stable document collection (spider,
crawler)
• Invalid document, duplication, etc.
• Huge number of documents (partial
collection)
• Multimedia documents
• Great variation of document quality
• Multilingual problem
• …
81
Final remarks on IR
• IR is related to many areas:
– NLP, AI, database, machine learning, user modeling…
– library, Web, multimedia search, …
• Relatively week theories
• Very strong tradition of experiments
• Many remaining (and exciting) problems
• Difficult area: Intuitive methods do not necessarily
improve effectiveness in practice
82
Why is IR difficult
• Vocabularies mismatching
– Synonymy: e.g. car v.s. automobile
– Polysemy: table
• Queries are ambiguous, they are partial specification of user’s
need
• Content representation may be inadequate and incomplete
• The user is the ultimate judge, but we don’t know how the
judge judges…
– The notion of relevance is imprecise, context- and user-dependent
• But how much it is rewarding to gain 10% improvement!
83

Information retrival system it is part and parcel

  • 1.
    Introduction to Information RetrievalSystem-ITDLO7024 Prof.V.E.Pawar BVCOENM 1
  • 2.
    Department of InformationTechology Vision: To create professionally competent engineers in the field of Information Technology to achieve social transformation through technology. Mission: • To impart quality education by keeping pace with rapidly changing technology. • To encourage young minds for research and entrepreneurship.
  • 3.
    Program Educational Objectives(PEO): • Totrain learners with solid foundation of Information Technology analyze, design solutions for the real life problems. • To inculcate the qualities of effective communication skill, teamwork and ethical responsibility to gain excellence in professional career. • To motivate learners for higher studies, research activities and lifelong learning. 3
  • 4.
    Program Outcomes(PO): Po1:Engineering knowledge:Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the solution of complex engineering problems. Po2:Problem analysis: Identify, formulate, review research literature, and analyze complex engineering problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences. Po3:Design/development of solutions: Design solutions for complex engineering problems and design system components or processes that meet the specified needs with appropriate consideration for the public health and safety, and the cultural, societal, and environmental considerations.
  • 5.
    Program Outcomes(PO): PO4: Conductinvestigations of complex problems: Use research- based knowledge and research methods including design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid conclusions. PO5:Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and modeling to complex engineering activities with an understanding of the limitations. PO9: Individual and team work: Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings.
  • 6.
    Program Outcomes(PO): PO4: Conductinvestigations of complex problems: Use research-based knowledge and research methods including design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid conclusions. PO5:Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and modeling to complex engineering activities with an understanding of the limitations. PO9: Individual and team work: Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings.
  • 7.
    nformation retrieval (IR) •in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds. 7
  • 8.
    IRS • An informationretrieval (IR) system is a set of algorithms that facilitate the relevance of displayed documents to searched queries. • In simple words, it works to sort and rank documents based on the queries of a user. 8
  • 9.
    Outline • Motivation • Retrivalprocess • Informtion system • Defination • Objectives • Information Vs data Retrival • Search engine ,browser 9
  • 10.
    Motivation • Information retrieval(IR) deals with the representation, storage, organization of, and access to information items. • The representation and organization of the information items should provide the user with easy access to the information in which he is interested. 10
  • 11.
    Definations • Information Retrievalsystem is a part and parcel of communication system. • The main objectives of Information retrieval is to supply right information, to the hand of right user at a right time. • Various materials and methods are used for retrieving our desired information. • The term Information retrieval first introduced by Calvin Mooers in 1951. 11
  • 12.
    • Information retrievalis the activity of obtaining information resources relevant to an information need from a collection of information resources. • It is a part of information science, which studies of those activities relating to the retrieval of information. • Searches can be based on metadata or on full-text indexing. We may define this process as; • 12
  • 13.
    Definations • Various Authorand books define information retrieval variously. Some of them are as: • According to Calvin Mooers (1951), “Searching and retrieval of information from storage according to specification by subject.” 13
  • 14.
    Definations • According toJ.H. Shera, • “Information Retrieval is the process of locating and selecting data relevant to a given requirement.” 14
  • 15.
    Defination • Oxford Dictionarydefine Information Retrieval as, • “The tracing and recovery of specific information from stored data.” 15
  • 16.
    Element of Informationretrieval: • a. Information carrier. • b. Descriptor. • c. Document address. • d. Transmission of information. 16
  • 17.
    Elelment • Information carrier: •Carrier is something that carries, hold or conveys something. On the other hand, Information Carrier is something that carriers or store information. For example: Film, Magnetic Tape, CD, DVD etc. 17
  • 18.
    • Descriptor: Termor terminology that’s used for to search information from storage is known as Descriptor. It is known as key words that we use for search information from a storage device. 18
  • 19.
    • Document address:Every document must have an address that identifies the location of that document. Here document address means ISBN, ISSN, code number, call number, shelf number or file number that helps us to retrieve information. 19
  • 20.
    • Transmission ofInformation: Transmission of Information means to supply any document at the hands of the users when needed. Information retrieval system uses various communication channel or networking tools for doing this. 20
  • 21.
  • 22.
    Basic concept • Theeffective retrieval of relevant information is directly affected both by the user task and by the logical view of the documents adopted by the retrieval system, as we now discuss. • 22
  • 23.
  • 24.
    Outline • What isthe IR problem • How to organize an IR system? (Or the main processes in IR) • Indexing • Retrieval • System evaluation • Some current research topics 24
  • 25.
    The problem ofIR • Goal = find documents relevant to an information need from a large document set 25 Document collection Info. need Query Answer list IR system Retrieval
  • 26.
  • 27.
    IR problem • Firstapplications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • external attributes and internal attribute (content) • Search by external attributes = Search in DB • IR: search by content 27
  • 28.
    Possible approaches 1.String matching(linear search in documents) - Slow - Difficult to improve 2.Indexing (*) - Fast - Flexible to further improvement 28
  • 29.
    Indexing-based IR Document Query indexing indexingindexing indexing (Query analysis) Representation Representation (keywords) Query (keywords) evaluation 29
  • 30.
    Main problems inIR • Document and query indexing – How to best represent their contents? • Query evaluation (or retrieval process) – To what extent does a document correspond to a query? • System evaluation – How good is a system? – Are the retrieved documents relevant? (precision) – Are all the relevant documents retrieved? (recall) 30
  • 31.
    Document indexing • Goal= Find the important meanings and create an internal representation • Factors to consider: – Accuracy to represent meanings (semantics) – Exhaustiveness (cover all the contents) – Facility for computer to manipulate • What is the best representation of contents? – Char. string (char trigrams): not precise enough – Word: good coverage, not precise – Phrase: poor coverage, more precise – Concept: poor coverage, precise 31 Coverage (Recall) Accuracy (Precision) String Word Phrase Concept
  • 32.
    Keyword selection andweighting • How to select important keywords? – Simple method: using middle-frequency words 32 Frequency/Informativity frequency informativity Max. Min. 1 2 3 … Rank
  • 33.
    • tf =term frequency – frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. • df = document frequency – no. of documents containing the term – distribution of the term • idf = inverse document frequency – the unevenness of term distribution in the corpus – the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) 33 tf*idf weighting schema
  • 34.
    Some common tf*idfschemes • tf(t, D)=freq(t,D) idf(t) = log(N/n) • tf(t, D)=log[freq(t,D)] n = #docs containing t • tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus • tf(t, D)=freq(t,d)/Max[f(t,d)] weight(t,D) = tf(t,D) * idf(t) • Normalization: Cosine normalization, /max, … 34
  • 35.
    Document Length Normalization •Sometimes, additional normalizations e.g. length: ) , ( _ ) 1 ( 1 ) , ( ) , ( D t weight normalized povot slope slope D t weight D t pivoted     35 pivot Probability of relevance Probability of retrieval Doc. length slope
  • 36.
    Stopwords / Stoplist •function words do not bear useful information for IR of, in, about, with, I, although, … • Stoplist: contain stopwords, not to be used as index – Prepositions – Articles – Pronouns – Some adverbs and adjectives – Some frequent words (e.g. document) • The removal of stopwords usually improves IR effectiveness • A few “standard” stoplists are commonly used. 36
  • 37.
    Stemming • Reason: – Differentword forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: – Removing some endings of word computer compute computes computing computed computation 37 comput
  • 38.
    Porter algorithm (Porter, M.F.,1980, An algorithm for suffix stripping, Program, 14(3) :130-137) • Step 1: plurals and past participles – SSES -> SS caresses -> caress – (*v*) ING -> motoring -> motor • Step 2: adj->n, n->v, n->adj, … – (m>0) OUSNESS -> OUS callousness -> callous – (m>0) ATIONAL -> ATE relational -> relate • Step 3: – (m>0) ICATE -> IC triplicate -> triplic • Step 4: – (m>1) AL -> revival -> reviv – (m>1) ANCE -> allowance -> allow • Step 5: – (m>1) E -> probate -> probat – (m > 1 and *d and *L) -> single letter controll -> control 38
  • 39.
    Lemmatization • transform tostandard form according to syntactic category. E.g. verb + ing  verb noun + s  noun – Need POS tagging – More accurate than stemming, but needs more resources • crucial to choose stemming/lemmatization rules noise v.s. recognition rate • compromise between precision and recall light/no stemming severe stemming -recall +precision +recall -precision 39
  • 40.
    Result of indexing •Each document is represented by a set of weighted keywords (terms): D1  {(t1, w1), (t2,w2), …} e.g. D1  {(comput, 0.2), (architect, 0.3), …} D2  {(comput, 0.1), (network, 0.5), …} • Inverted file: comput  {(D1,0.2), (D2,0.1), …} Inverted file is used during retrieval for higher efficiency. 40
  • 41.
    Retrieval • The problemsunderlying retrieval – Retrieval model • How is a document represented with the selected keywords? • How are document and query representations compared to calculate a score? – Implementation 41
  • 42.
    Cases • 1-word query: Thedocuments to be retrieved are those that include the word - Retrieve the inverted list for the word - Sort in decreasing order of the weight of the word • Multi-word query? - Combining several lists - How to interpret the weight? (IR model) 42
  • 43.
    IR models • Matchingscore model – Document D = a set of weighted keywords – Query Q = a set of non-weighted keywords – R(D, Q) = i w(ti , D) where ti is in Q. 43
  • 44.
    Boolean model – Document= Logical conjunction of keywords – Query = Boolean expression of keywords – R(D, Q) = D Q e.g. D = t1  t2  …  tn Q = (t1  t2 )  (t3  t4 ) D Q, thus R(D, Q) = 1. Problems: – R is either 1 or 0 (unordered set of documents) – many documents or few documents – End-users cannot manipulate Boolean operators correctly E.g. documents about kangaroos and koalas 44
  • 45.
    Extensions to Booleanmodel (for document ordering) • D = {…, (ti , wi ), …}: weighted keywords • Interpretation: – D is a member of class ti to degree wi . – In terms of fuzzy sets: ti (D) = wi A possible Evaluation: R(D, ti ) = ti (D); R(D, Q1  Q2 ) = min(R(D, Q1 ), R(D, Q2 )); R(D, Q1  Q2 ) = max(R(D, Q1 ), R(D, Q2 )); R(D, Q1 ) = 1 - R(D, Q1 ). 45
  • 46.
    Vector space model •Vector space = all the keywords encountered <t1 , t2 , t3 , …, tn > • Document D = < a1 , a2 , a3 , …, an > ai = weight of ti in D • Query Q = < b1 , b2 , b3 , …, bn > bi = weight of ti in Q • R(D,Q) = Sim(D,Q) 46
  • 47.
    Matrix representation t1 t2t3 … tn D1 a11 a12 a13 … a1n D2 a21 a22 a23 … a2n D3 a31 a32 a33 … a3n … Dm am1 am2 am3 … amn Q b1 b2 b3 … bn 47 Term vector space Document space
  • 48.
    Some formulas forSim Dot product Cosine Dice Jaccard 48                   i i i i i i i i i i i i i i i i i i i i i i i i i i b a b a b a Q D Sim b a b a Q D Sim b a b a Q D Sim b a Q D Sim ) * ( ) * ( ) , ( ) * ( 2 ) , ( * ) * ( ) , ( ) * ( ) , ( 2 2 2 2 2 2 t1 t2 D Q
  • 49.
    Implementation (space) • Matrixis very sparse: a few 100s terms for a document, and a few terms for a query, while the term space is large (~100k) • Stored as: D1  {(t1, a1), (t2,a2), …} t1  {(D1,a1), …} 49
  • 50.
    Implementation (time) • Theimplementation of VSM with dot product: – Naïve implementation: O(m*n) – Implementation using inverted file: Given a query = {(t1,b1), (t2,b2)}: 1. find the sets of related documents through inverted file for t1 and t2 2. calculate the score of the documents to each weighted term (t1,b1)  {(D1,a1 *b1), …} 3. combine the sets and sum the weights () – O(|Q|*n) 50
  • 51.
    Other similarities • Cosine: -use and to normalize the weights after indexing - Dot product (Similar operations do not apply to Dice and Jaccard)         j j i i j j i j j i i i i i b b a a b a b a Q D Sim 2 2 2 2 * ) * ( ) , (  j j b 2 51  j j a 2
  • 52.
    Probabilistic model • GivenD, estimate P(R|D) and P(NR|D) • P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t1=x1, t2=x2, …} •     absent present xi 0 1                        i i i i i i i i i i i i i i t x i x i t x i x i t x i x i t x i x i D x t i i q q NR t P NR t P NR D P p p R t P R t P R x t P R D P ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) ( ) 1 ( ) | 0 ( ) | 1 ( ) | ( ) 1 ( ) | 0 ( ) | 1 ( ) | ( ) | ( 52
  • 53.
    Prob. model (cont’d) ) 1 ( ) 1 ( log 1 1 log ) 1 ( ) 1 ( log ) 1 ( ) 1 ( log ) | ( ) | ( log ) () 1 ( ) 1 ( i i i i t i t i i i i i i t i t x i x i t x i x i p q q p x q p p q q p x q q p p NR D P R D P D Odd i i i i i i i i i                     53 For document ranking
  • 54.
    Prob. model (cont’d) •How to estimate pi and qi? • A set of N relevant and irrelevant samples: ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. with ti Ri-ri Rel. doc. without ti N-Ri–n+ri Irrel.doc. without ti N-ni Doc. without ti Ri Rel. doc N-Ri Irrel.doc. N Samples i i i i i i i R N r n q R r p     54
  • 55.
    Prob. model (cont’d) •Smoothing (Robertson-Sparck-Jones formula) • When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N • May be implemented as VSM ) )( ( ) ( ) 1 ( ) 1 ( log ) ( i i i i i i i i t i i i i i t i r n r R r n R N r x p q q p x D Odd i i                          D t i i i i i i i i i t i i i w r n r R r n R N r x D Odd ) 5 . 0 )( 5 . 0 ( ) 5 . 0 )( 5 . 0 ( ) ( 55
  • 56.
    BM25 • k1, k2,k3, d: parameters • qtf: query term frequency • dl: document length • avdl: average document length ) ) 1 (( | | ) 1 ( ) 1 ( ) , ( 1 2 3 3 1 dl avdl dl b b k K dl avdl dl avdl Q k qtf k qtf k tf K tf k w Q D Score Q t              56
  • 57.
    (Classic) Presentation ofresults • Query evaluation result is a list of documents, sorted by their similarity to the query. • E.g. doc1 0.67 doc2 0.65 doc3 0.54 … 57
  • 58.
    System evaluation • Efficiency:time, space • Effectiveness: – How is a system capable of retrieving relevant documents? – Is a system better than another one? • Metrics often used (together): – Precision = retrieved relevant docs / retrieved docs – Recall = retrieved relevant docs / relevant docs relevant retrieved 58 retrieved relevant
  • 59.
    General form ofprecision/recall Precision 1.0 Recall 1.0 59 -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
  • 60.
    An illustration ofP/R calculation 60 List Rel? Doc1 Y Doc2 Doc3 Y Doc4 Y Doc5 … Precision 1.0 - * (0.2, 1.0) 0.8 - * (0.6, 0.75) * (0.4, 0.67) 0.6 - * (0.6, 0.6) * (0.2, 0.5) 0.4 - 0.2 - 0.0 | | | | | Recall 0.2 0.4 0.6 0.8 1.0 Assume: 5 relevant docs.
  • 61.
    MAP (Mean AveragePrecision) • rij = rank of the j-th relevant document for Qi • |Ri| = #rel. doc. for Qi • n = # test queries • E.g. Rank: 1 4 1st rel. doc. 5 8 2nd rel. doc. 10 3rd rel. doc.     i i j Q R D ij i r j R n MAP | | 1 1 )] 8 2 4 1 ( 2 1 ) 10 3 5 2 1 1 ( 3 1 [ 2 1      MAP 61
  • 62.
    Some other measures •Noise = retrieved irrelevant docs / retrieved docs • Silence = non-retrieved relevant docs / relevant docs – Noise = 1 – Precision; Silence = 1 – Recall • Fallout = retrieved irrel. docs / irrel. docs • Single value measures: – F-measure = 2 P * R / (P + R) – Average precision = average at 11 points of recall – Precision at n document (often used for Web IR) – Expected search length (no. irrelevant documents to read before obtaining n relevant doc.) 62
  • 63.
    Test corpus • Comparedifferent IR systems on the same test corpus • A test corpus contains: –A set of documents –A set of queries –Relevance judgment for every document-query pair (desired answers for each query) • The results of a system is compared with the desired answers. 63
  • 64.
    An evaluation example (SMART) Runnumber: 1 2 Num_queries: 52 52 Total number of documents over all queries Retrieved: 780 780 Relevant: 796 796 Rel_ret: 246 229 Recall - Precision Averages: at 0.00 0.7695 0.7894 at 0.10 0.6618 0.6449 at 0.20 0.5019 0.5090 at 0.30 0.3745 0.3702 at 0.40 0.2249 0.3070 at 0.50 0.1797 0.2104 at 0.60 0.1143 0.1654 at 0.70 0.0891 0.1144 at 0.80 0.0891 0.1096 at 0.90 0.0699 0.0904 at 1.00 0.0699 0.0904 Average precision for all points 11-pt Avg: 0.2859 0.3092 % Change: 8.2 Recall: Exact: 0.4139 0.4166 at 5 docs: 0.2373 0.2726 at 10 docs: 0.3254 0.3572 at 15 docs: 0.4139 0.4166 at 30 docs: 0.4139 0.4166 Precision: Exact: 0.3154 0.2936 At 5 docs: 0.4308 0.4192 At 10 docs: 0.3538 0.3327 At 15 docs: 0.3154 0.2936 At 30 docs: 0.1577 0.1468 64
  • 65.
    The TREC experiments •Once per year • A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) • Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July) • NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) • TREC conference (November) 65
  • 66.
    TREC evaluation methodology •Known document collection (>100K) and query set (50) • Submission of 1000 documents for each query by each participant • Merge 100 first documents of each participant -> global pool • Human relevance judgment of the global pool • The other documents are assumed to be irrelevant • Evaluation of each system (with 1000 answers) – Partial relevance judgments – But stable for system ranking 66
  • 67.
    Tracks (tasks) • AdHoc track: given document collection, different topics • Routing (filtering): stable interests (user profile), incoming document flow • CLIR: Ad Hoc, but with queries in a different language • Web: a large set of Web pages • Question-Answering: When did Nixon visit China? • Interactive: put users into action with system • Spoken document retrieval • Image and video retrieval • Information tracking: new topic / follow up 67
  • 68.
    CLEF and NTCIR •CLEF = Cross-Language Experimental Forum – for European languages – organized by Europeans – Each per year (March – Oct.) • NTCIR: – Organized by NII (Japan) – For Asian languages – cycle of 1.5 year 68
  • 69.
    Impact of TREC •Provide large collections for further experiments • Compare different systems/techniques on realistic data • Develop new methodology for system evaluation • Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …) 69
  • 70.
    Some techniques toimprove IR effectiveness • Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document • The use of relevance feedback – To improve query expression: Qnew = *Qold + *Rel_d - *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents 70
  • 71.
    Effect of RF 71 *x * x x * * * x x * * R* Q * NR x * x * x x * * x Qnew * * * * * * * * * 1st retrieval 2nd retrieval
  • 72.
    Modified relevance feedback •Users usually do not cooperate (e.g. AltaVista in early years) • Pseudo-relevance feedback (Blind RF) – Using the top-ranked documents as if they are relevant: • Select m terms from n top-ranked documents – One can usually obtain about 10% improvement 72
  • 73.
    Query expansion • Aquery contains part of the important words • Add new (related) terms into the query – Manually constructed knowledge base/thesaurus (e.g. Wordnet) • Q = information retrieval • Q’ = (information + data + knowledge + …) (retrieval + search + seeking + …) – Corpus analysis: • two terms that often co-occur are related (Mutual information) • Two terms that co-occur with the same words are related (e.g. T-shirt and coat with wear, …) 73
  • 74.
    Global vs. localcontext analysis • Global analysis: use the whole document collection to calculate term relationships • Local analysis: use the query to retrieve a subset of documents, then calculate term relationships – Combine pseudo-relevance feedback and term co- occurrences – More effective than global analysis 74
  • 75.
    Some current researchtopics: Go beyond keywords • Keywords are not perfect representatives of concepts – Ambiguity: table = data structure, furniture? – Lack of precision: “operating”, “system” less precise than “operating_system” • Suggested solution – Sense disambiguation (difficult due to the lack of contextual information) – Using compound terms (no complete dictionary of compound terms, variation in form) – Using noun phrases (syntactic patterns + statistics) • Still a long way to go 75
  • 76.
    Theory … • Bayesiannetworks –P(Q|D) D1 D2 D3 … Dm t1 t2 t3 t4 …. tn c1 c2 c3 c4 … cl Inference Q revision • Language models 76
  • 77.
    Logical models • Howto describe the relevance relation as a logical relation? D => Q • What are the properties of this relation? • How to combine uncertainty with a logical framework? • The problem: What is relevance? 77
  • 78.
    Related applications: Information filtering •IR: changing queries on stable document collection • IF: incoming document flow with stable interests (queries) – yes/no decision (in stead of ordering documents) – Advantage: the description of user’s interest may be improved using relevance feedback (the user is more willing to cooperate) – Difficulty: adjust threshold to keep/ignore document – The basic techniques used for IF are the same as those for IR – “Two sides of the same coin” 78 IF … doc3, doc2, doc1 keep ignore User profile
  • 79.
    IR for (semi-)structureddocuments • Using structural information to assign weights to keywords (Introduction, Conclusion, …) – Hierarchical indexing • Querying within some structure (search in title, etc.) – INEX experiments • Using hyperlinks in indexing and retrieval (e.g. Google) • … 79
  • 80.
    PageRank in Google •Assign a numeric value to each page • The more a page is referred to by important pages, the more this page is important • d: damping factor (0.85) • Many other criteria: e.g. proximity of query words – “…information retrieval …” better than “… information … retrieval …”     i i i I C I PR d d A PR ) ( ) ( ) 1 ( ) ( 80 A B I1 I2
  • 81.
    IR on theWeb • No stable document collection (spider, crawler) • Invalid document, duplication, etc. • Huge number of documents (partial collection) • Multimedia documents • Great variation of document quality • Multilingual problem • … 81
  • 82.
    Final remarks onIR • IR is related to many areas: – NLP, AI, database, machine learning, user modeling… – library, Web, multimedia search, … • Relatively week theories • Very strong tradition of experiments • Many remaining (and exciting) problems • Difficult area: Intuitive methods do not necessarily improve effectiveness in practice 82
  • 83.
    Why is IRdifficult • Vocabularies mismatching – Synonymy: e.g. car v.s. automobile – Polysemy: table • Queries are ambiguous, they are partial specification of user’s need • Content representation may be inadequate and incomplete • The user is the ultimate judge, but we don’t know how the judge judges… – The notion of relevance is imprecise, context- and user-dependent • But how much it is rewarding to gain 10% improvement! 83