Lecture 01
Information Retrieval
About the Course
 Book:
 An Introduction to Information Retrieval, Christopher D.
Manning Prabhakar Raghavan Hinrich Schütze, Cambridge
University Press, 2009.
 Other materials may be considered depending on the subject.
 Principal objective of this course:
 To introduce students to Information Retrieval concepts,
paradigms and techniques, with an emphasis on String and
Semantics based IR techniques.
About the Course
 Grading & Assessment:
 First Exam …………………….. 20%
 Second Exam ………………….. 20%
 Final Exam …………………….. 35%
 Other Activities ………………. 10%
 Major Assignment ……………. 15%
“You are to build a prototype for a search engine that employs
both text-based and semantics-based techniques for retrieving the
most relevant results to users’ queries. The search space will be a
collection of documents, in addition to a collection of images
associated with some textual descriptions”.
Course Topics
 Part 01 – Introduction
 What is IR?
 Examples of IR Systems.
 Other topics related to IR.
 Models of IR
 Part 02 – Boolean Retrieval
 What is Boolean IR?
 Term-Document Incidence Matrices
 Terminology and Notations
Course Topics
 Part 03 – Indexing
 Building Indexes
 Semantic Networks
 Part 04 – Retrieval
 Scoring, Ranking
 Relevance Feedback
 Precision/Recall
Course Topics
 Part 05 – Exploiting Ontologies in IR
 Ontologies
 Traditional vs. Semantics-based IR techniques
Introduction
What is IR
 Information Retrieval:
“Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).”
 Unstructured Data:
“refers to data which does not have clear, semantically overt, easy-for-a-
computer structure.”
e.g.  Textual information in web pages.
 Semistructured Data:
“refers to data which have a partially clear, semantically overt, easy-for-a-
computer structure.”
e.g.  finding a document where the title contains Java and the body
contains threading.
Introduction
What is IR
 Structured Data:
“refers to data which have a clear, semantically overt, easy-
for-a-computer structure.”
e.g.  Relational Databases.
 A look back: 1990s
 Studies showed that most people preferred getting
information from other people rather than from information
retrieval systems.
 Online booking systems?
 Following to this period and after relentless optimization of
IR:
 The field of information retrieval has moved from being a
primarily academic discipline to being the basis underlying
most people’s preferred means of information access.
Introduction
What is IR
 Information retrieval did not begin with the Web.
 The field began with scientific publications and library
records, but soon spread to other forms of content, particularly
those of information professionals, such as journalists, lawyers,
and doctors
Introduction
What is IR
Introduction
Other Topics Related to IR
 Cross-language IR
 Multimedia IR
 Speech retrieval
 User interfaces for IR
 Ontology and Semantics-based IR
 Natural Language Processing (NLP) techniques
 Dynamic IR
 Online Advertising !?
Introduction
Other Topics Related to IR
 The field of information retrieval also covers supporting users in
browsing or filtering document collections or further processing
a set of retrieved documents.
 Given a set of documents, clustering is the task of coming up
with a good grouping of the documents based on their contents.
 Given a set of topics, standing information needs, or other
categories (such as suitability of texts for different age groups),
classification is the task of deciding which class(es), if any,
each of a set of documents belongs to. It is often approached by
first manually classifying some documents and then hoping to
be able to classify new documents automatically.
Introduction
Classification of IR systems
 Scale-based Classification of IR systems: Distinguishing
between Information retrieval systems according to the scale at
which they operate.
1. Web search: The search is conducted over billions of
documents stored on millions of computers.
 Issues to consider:
1. Needing to gather documents for indexing.
2. Being able to build systems that work efficiently at this
enormous scale.
3. Handling particular aspects of the web, such as the
exploitation of hypertext and page ranking given the
commercial importance of the web.
2. Personal Information Retrieval: Integrating information
retrieval into consumer operating systems.
 Issues to consider:
1. Handling the broad range of document types on a typical
personal computer.
2. Making the search system maintenance free and
sufficiently lightweight in terms of startup, processing, and
disk space usage that it can run on one machine without
annoying its owner.
Introduction
Classification of IR systems
3. Enterprise, Institutional, and Domain-specific Search:
A corporation’s documents will typically be stored on
centralized file systems and one or a handful of
dedicated machines will provide search over the
collection.
 Issues to consider:
1. Handling the broad range of document types on a
centralized computer.
2. Scale and Efficiency of the IR system.
3. Maintenance of the search system.
Introduction
Classification of IR systems
Introduction
Classification of IR systems
 Technique-based Classification of IR systems:
Distinguishing between Information retrieval systems
according to the search technique that they employ.
1. Keyword-based search: String matching algorithms are
employed to find documents relevant to the user’s query.
 Issues to consider:
1. Precision and Recall of the search algorithm.
2. Gap between the textual information contained in the
document collections and the user’s information need.
Introduction
Classification of IR systems
2. Semantics-based search: Semantic aspects of the
user’s query are derived in an attempt to find documents
relevant to the user’s query.
 Issues to consider:
1. Precision and Recall of the search algorithm.
2. Lack of Semantic Resources.
3. Incompleteness of Background Knowledge
represented in existing Semantic Resources.
4. Semantic Heterogeneity problem between existing
Semantic Resources.
5. Lack of Multi-lingual Semantic Resources.
Introduction
Classification of IR systems
2. Hybrid Approaches: Keyword-based search is enriched with
Semantics-based search to retrieve more relevant results to the
user’s information needs.
 Issues to consider:
1. Precision and Recall of the search algorithm.
2. Lack of Semantic Resources.
3. Priority of the employed techniques.
4. Incompleteness of Background Knowledge represented in
existing Semantic Resources.
5. Types of queries that the system can handle (Single-term vs.
Verbose queries).
6. Lack of Multi-lingual Semantic Resources.
 Research is very active in this area.
 Example: Dbpedia based search engine (June 2015)

Ir 01

  • 1.
  • 2.
    About the Course Book:  An Introduction to Information Retrieval, Christopher D. Manning Prabhakar Raghavan Hinrich Schütze, Cambridge University Press, 2009.  Other materials may be considered depending on the subject.  Principal objective of this course:  To introduce students to Information Retrieval concepts, paradigms and techniques, with an emphasis on String and Semantics based IR techniques.
  • 3.
    About the Course Grading & Assessment:  First Exam …………………….. 20%  Second Exam ………………….. 20%  Final Exam …………………….. 35%  Other Activities ………………. 10%  Major Assignment ……………. 15% “You are to build a prototype for a search engine that employs both text-based and semantics-based techniques for retrieving the most relevant results to users’ queries. The search space will be a collection of documents, in addition to a collection of images associated with some textual descriptions”.
  • 4.
    Course Topics  Part01 – Introduction  What is IR?  Examples of IR Systems.  Other topics related to IR.  Models of IR  Part 02 – Boolean Retrieval  What is Boolean IR?  Term-Document Incidence Matrices  Terminology and Notations
  • 5.
    Course Topics  Part03 – Indexing  Building Indexes  Semantic Networks  Part 04 – Retrieval  Scoring, Ranking  Relevance Feedback  Precision/Recall
  • 6.
    Course Topics  Part05 – Exploiting Ontologies in IR  Ontologies  Traditional vs. Semantics-based IR techniques
  • 7.
    Introduction What is IR Information Retrieval: “Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).”  Unstructured Data: “refers to data which does not have clear, semantically overt, easy-for-a- computer structure.” e.g.  Textual information in web pages.  Semistructured Data: “refers to data which have a partially clear, semantically overt, easy-for-a- computer structure.” e.g.  finding a document where the title contains Java and the body contains threading.
  • 8.
    Introduction What is IR Structured Data: “refers to data which have a clear, semantically overt, easy- for-a-computer structure.” e.g.  Relational Databases.
  • 9.
     A lookback: 1990s  Studies showed that most people preferred getting information from other people rather than from information retrieval systems.  Online booking systems?  Following to this period and after relentless optimization of IR:  The field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people’s preferred means of information access. Introduction What is IR
  • 10.
     Information retrievaldid not begin with the Web.  The field began with scientific publications and library records, but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors Introduction What is IR
  • 11.
    Introduction Other Topics Relatedto IR  Cross-language IR  Multimedia IR  Speech retrieval  User interfaces for IR  Ontology and Semantics-based IR  Natural Language Processing (NLP) techniques  Dynamic IR  Online Advertising !?
  • 12.
    Introduction Other Topics Relatedto IR  The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents.  Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents.  Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically.
  • 13.
    Introduction Classification of IRsystems  Scale-based Classification of IR systems: Distinguishing between Information retrieval systems according to the scale at which they operate. 1. Web search: The search is conducted over billions of documents stored on millions of computers.  Issues to consider: 1. Needing to gather documents for indexing. 2. Being able to build systems that work efficiently at this enormous scale. 3. Handling particular aspects of the web, such as the exploitation of hypertext and page ranking given the commercial importance of the web.
  • 14.
    2. Personal InformationRetrieval: Integrating information retrieval into consumer operating systems.  Issues to consider: 1. Handling the broad range of document types on a typical personal computer. 2. Making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner. Introduction Classification of IR systems
  • 15.
    3. Enterprise, Institutional,and Domain-specific Search: A corporation’s documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection.  Issues to consider: 1. Handling the broad range of document types on a centralized computer. 2. Scale and Efficiency of the IR system. 3. Maintenance of the search system. Introduction Classification of IR systems
  • 16.
    Introduction Classification of IRsystems  Technique-based Classification of IR systems: Distinguishing between Information retrieval systems according to the search technique that they employ. 1. Keyword-based search: String matching algorithms are employed to find documents relevant to the user’s query.  Issues to consider: 1. Precision and Recall of the search algorithm. 2. Gap between the textual information contained in the document collections and the user’s information need.
  • 17.
    Introduction Classification of IRsystems 2. Semantics-based search: Semantic aspects of the user’s query are derived in an attempt to find documents relevant to the user’s query.  Issues to consider: 1. Precision and Recall of the search algorithm. 2. Lack of Semantic Resources. 3. Incompleteness of Background Knowledge represented in existing Semantic Resources. 4. Semantic Heterogeneity problem between existing Semantic Resources. 5. Lack of Multi-lingual Semantic Resources.
  • 18.
    Introduction Classification of IRsystems 2. Hybrid Approaches: Keyword-based search is enriched with Semantics-based search to retrieve more relevant results to the user’s information needs.  Issues to consider: 1. Precision and Recall of the search algorithm. 2. Lack of Semantic Resources. 3. Priority of the employed techniques. 4. Incompleteness of Background Knowledge represented in existing Semantic Resources. 5. Types of queries that the system can handle (Single-term vs. Verbose queries). 6. Lack of Multi-lingual Semantic Resources.  Research is very active in this area.  Example: Dbpedia based search engine (June 2015)