1
Information Retrieval
Acknowledgements:
Dr Mounia Lalmas (QMW)
Dr Joemon Jose (Glasgow)
2
Course Text
 Modern Information
Retrieval,
 R. Baeza-yates and B.
Ribeiro-Neto.,
 Addison-Wesley and ACM
Press, 1999,
 ISBN: 0-201-39829-X
3
Introduction
 Example of information need in the context of the world
wide web:
 “Find all documents containing information on computer
courses which:
(1) are offered by universities in South England, and
(2) are accredited by the BCS/IEE bodies. To be
relevant, the document must include information on
admission requirements, and e-mail and phone number
for contact purpose.”
  Information Retrieval
4
Information Retrieval
Retrieval System
Query
Set of retrieved documents
Documents
Information Need
Search Engine
Useful or relevant
information to the
user
Primary goal of an IR system
“Retrieve all the documents which are relevant to a user query,
while retrieving as few non-relevant documents as possible.”
 Representation, storage, organisation, and access to
information items
 (Usually) keyword-based representation
5
User tasks
 Pull technology
 User requests
information in an
interactive manner
 3 retrieval tasks
– Browsing (hypertext)
– Retrieval (classical IR
systems)
– Browsing and retrieval
(modern digital libraries
and web systems)
 Push technology
– automatic and
permanent pushing of
information to user
– software agents
– example: news service
– filtering (retrieval
task) relevant
information for later
inspection by user
6
Documents
 Unit of retrieval
 A passage of free text
– composed of text, strings of characters
from an alphabet
– composed of natural language
 newspaper article, a journal paper, a
dictionary definition, email messages
– size of documents
 arbitrary
 newspaper article vs. journal paper vs. email
7
What is a document?
8
Representation of documents
 Set of index terms or keywords
– extracted directly form text
– specified by human subjects (information science)  metadata
 Most concise representation
 Poor quality of retrieval
 Full text representation
– Most complete representation
– High computational cost
 Large collections
– Reduce set of representative keywords
 Elimination of stop words
 Stemming
 Identification of noun phrases
 Further compression
 Structure representation
– Chapter, section, sub-section, etc
Document term
descriptors to
access texts
Generation of
descriptors for
text
• By hand
• By analysing the text
9
The retrieval process
Information need
Query
Formulation
Documents
Document representation
Indexing
Retrieved documents
Retrieval functions
Relevance
feedback
10
Queries
 Information Need
 Simple queries
– composed of two or three, perhaps even
dozens, of keywords
– e.g., as in web retrieval
 Boolean queries
– “neural networks AND speech recognition”
 Context Queries
– Proximity search, phrase queries
User term
descriptors
characterising
the user need
11
Best-Match Retrieval
 Compare the terms in a document
and query
 Compute similarity between each
document in the collection and the
query based on the terms that they
have in common
 Sorting the documents in order of
decreasing similarity with the query
 The outputs are a ranked list and
displayed to the user - the top ones
are more relevant as judged by the
system
Document term
descriptors to
access texts
User term
descriptors
characterising
the user need
12
Conceptual View of Text
Retrieval
Queries Documents
Similarity
Computation
Retrieved
Documents
13
Expanded view of text retrieval
system
Queries Documents
Indexing
Indexed
Documents
Similarity
Computation
Retrieved
Documents
Ranked
Documents
14
Process of retrieving info
User Interface
Text Operations
Query
Operations
Indexing
Similarity
Computation
Ranking
Document
Repository
Manager
Index
User
need
Logical view Logical view
Inverted
file
Query
Retrieved docs
Text
Text
User feedback
Ranked docs
Text
repository
15
Key Topics
 Indexing text documents
 Retrieving text documents
 Evaluation
 Query reformulations
Search Engines
=
IR + Link Structure + Name Interpretation
16
Information Retrieval
vs Information Extraction
 Information Retrieval
– Given a set of query terms and a set of document
terms select only the most relevant documents
[precision], and preferably all the relevant [recall].
 Information Extraction
– Extract from the text what the document means.
 IR systems can FIND documents but need not
“understand” them

IRintroduction.ppt

  • 1.
    1 Information Retrieval Acknowledgements: Dr MouniaLalmas (QMW) Dr Joemon Jose (Glasgow)
  • 2.
    2 Course Text  ModernInformation Retrieval,  R. Baeza-yates and B. Ribeiro-Neto.,  Addison-Wesley and ACM Press, 1999,  ISBN: 0-201-39829-X
  • 3.
    3 Introduction  Example ofinformation need in the context of the world wide web:  “Find all documents containing information on computer courses which: (1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies. To be relevant, the document must include information on admission requirements, and e-mail and phone number for contact purpose.”   Information Retrieval
  • 4.
    4 Information Retrieval Retrieval System Query Setof retrieved documents Documents Information Need Search Engine Useful or relevant information to the user Primary goal of an IR system “Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”  Representation, storage, organisation, and access to information items  (Usually) keyword-based representation
  • 5.
    5 User tasks  Pulltechnology  User requests information in an interactive manner  3 retrieval tasks – Browsing (hypertext) – Retrieval (classical IR systems) – Browsing and retrieval (modern digital libraries and web systems)  Push technology – automatic and permanent pushing of information to user – software agents – example: news service – filtering (retrieval task) relevant information for later inspection by user
  • 6.
    6 Documents  Unit ofretrieval  A passage of free text – composed of text, strings of characters from an alphabet – composed of natural language  newspaper article, a journal paper, a dictionary definition, email messages – size of documents  arbitrary  newspaper article vs. journal paper vs. email
  • 7.
    7 What is adocument?
  • 8.
    8 Representation of documents Set of index terms or keywords – extracted directly form text – specified by human subjects (information science)  metadata  Most concise representation  Poor quality of retrieval  Full text representation – Most complete representation – High computational cost  Large collections – Reduce set of representative keywords  Elimination of stop words  Stemming  Identification of noun phrases  Further compression  Structure representation – Chapter, section, sub-section, etc Document term descriptors to access texts Generation of descriptors for text • By hand • By analysing the text
  • 9.
    9 The retrieval process Informationneed Query Formulation Documents Document representation Indexing Retrieved documents Retrieval functions Relevance feedback
  • 10.
    10 Queries  Information Need Simple queries – composed of two or three, perhaps even dozens, of keywords – e.g., as in web retrieval  Boolean queries – “neural networks AND speech recognition”  Context Queries – Proximity search, phrase queries User term descriptors characterising the user need
  • 11.
    11 Best-Match Retrieval  Comparethe terms in a document and query  Compute similarity between each document in the collection and the query based on the terms that they have in common  Sorting the documents in order of decreasing similarity with the query  The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system Document term descriptors to access texts User term descriptors characterising the user need
  • 12.
    12 Conceptual View ofText Retrieval Queries Documents Similarity Computation Retrieved Documents
  • 13.
    13 Expanded view oftext retrieval system Queries Documents Indexing Indexed Documents Similarity Computation Retrieved Documents Ranked Documents
  • 14.
    14 Process of retrievinginfo User Interface Text Operations Query Operations Indexing Similarity Computation Ranking Document Repository Manager Index User need Logical view Logical view Inverted file Query Retrieved docs Text Text User feedback Ranked docs Text repository
  • 15.
    15 Key Topics  Indexingtext documents  Retrieving text documents  Evaluation  Query reformulations Search Engines = IR + Link Structure + Name Interpretation
  • 16.
    16 Information Retrieval vs InformationExtraction  Information Retrieval – Given a set of query terms and a set of document terms select only the most relevant documents [precision], and preferably all the relevant [recall].  Information Extraction – Extract from the text what the document means.  IR systems can FIND documents but need not “understand” them