Ir 01

Lecture 01
Information Retrieval

About the Course
 Book:
 An Introduction to Information Retrieval, Christopher D.
Manning Prabhakar Raghavan Hinrich Schütze, Cambridge
University Press, 2009.
 Other materials may be considered depending on the subject.
 Principal objective of this course:
 To introduce students to Information Retrieval concepts,
paradigms and techniques, with an emphasis on String and
Semantics based IR techniques.

About the Course
 Grading & Assessment:
 First Exam …………………….. 20%
 Second Exam ………………….. 20%
 Final Exam …………………….. 35%
 Other Activities ………………. 10%
 Major Assignment ……………. 15%
“You are to build a prototype for a search engine that employs
both text-based and semantics-based techniques for retrieving the
most relevant results to users’ queries. The search space will be a
collection of documents, in addition to a collection of images
associated with some textual descriptions”.

Course Topics
 Part 01 – Introduction
 What is IR?
 Examples of IR Systems.
 Other topics related to IR.
 Models of IR
 Part 02 – Boolean Retrieval
 What is Boolean IR?
 Term-Document Incidence Matrices
 Terminology and Notations

Course Topics
 Part 03 – Indexing
 Building Indexes
 Semantic Networks
 Part 04 – Retrieval
 Scoring, Ranking
 Relevance Feedback
 Precision/Recall

Course Topics
 Part 05 – Exploiting Ontologies in IR
 Ontologies
 Traditional vs. Semantics-based IR techniques

Introduction
What is IR
 Information Retrieval:
“Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).”
 Unstructured Data:
“refers to data which does not have clear, semantically overt, easy-for-a-
computer structure.”
e.g.  Textual information in web pages.
 Semistructured Data:
“refers to data which have a partially clear, semantically overt, easy-for-a-
computer structure.”
e.g.  finding a document where the title contains Java and the body
contains threading.

Introduction
What is IR
 Structured Data:
“refers to data which have a clear, semantically overt, easy-
for-a-computer structure.”
e.g.  Relational Databases.

 A look back: 1990s
 Studies showed that most people preferred getting
information from other people rather than from information
retrieval systems.
 Online booking systems?
 Following to this period and after relentless optimization of
IR:
 The field of information retrieval has moved from being a
primarily academic discipline to being the basis underlying
most people’s preferred means of information access.
Introduction
What is IR

 Information retrieval did not begin with the Web.
 The field began with scientific publications and library
records, but soon spread to other forms of content, particularly
those of information professionals, such as journalists, lawyers,
and doctors
Introduction
What is IR

Introduction
Other Topics Related to IR
 Cross-language IR
 Multimedia IR
 Speech retrieval
 User interfaces for IR
 Ontology and Semantics-based IR
 Natural Language Processing (NLP) techniques
 Dynamic IR
 Online Advertising !?

Introduction
Other Topics Related to IR
 The field of information retrieval also covers supporting users in
browsing or filtering document collections or further processing
a set of retrieved documents.
 Given a set of documents, clustering is the task of coming up
with a good grouping of the documents based on their contents.
 Given a set of topics, standing information needs, or other
categories (such as suitability of texts for different age groups),
classification is the task of deciding which class(es), if any,
each of a set of documents belongs to. It is often approached by
first manually classifying some documents and then hoping to
be able to classify new documents automatically.

Introduction
Classification of IR systems
 Scale-based Classification of IR systems: Distinguishing
between Information retrieval systems according to the scale at
which they operate.
1. Web search: The search is conducted over billions of
documents stored on millions of computers.
 Issues to consider:
1. Needing to gather documents for indexing.
2. Being able to build systems that work efficiently at this
enormous scale.
3. Handling particular aspects of the web, such as the
exploitation of hypertext and page ranking given the
commercial importance of the web.

2. Personal Information Retrieval: Integrating information
retrieval into consumer operating systems.
1. Handling the broad range of document types on a typical
personal computer.
2. Making the search system maintenance free and
sufficiently lightweight in terms of startup, processing, and
disk space usage that it can run on one machine without
annoying its owner.
Introduction

3. Enterprise, Institutional, and Domain-specific Search:
A corporation’s documents will typically be stored on
centralized file systems and one or a handful of
dedicated machines will provide search over the
collection.
1. Handling the broad range of document types on a
centralized computer.
2. Scale and Efficiency of the IR system.
3. Maintenance of the search system.
Introduction

Introduction
 Technique-based Classification of IR systems:
Distinguishing between Information retrieval systems
according to the search technique that they employ.
1. Keyword-based search: String matching algorithms are
employed to find documents relevant to the user’s query.
1. Precision and Recall of the search algorithm.
2. Gap between the textual information contained in the
document collections and the user’s information need.

Introduction
2. Semantics-based search: Semantic aspects of the
user’s query are derived in an attempt to find documents
relevant to the user’s query.
2. Lack of Semantic Resources.
3. Incompleteness of Background Knowledge
represented in existing Semantic Resources.
4. Semantic Heterogeneity problem between existing
Semantic Resources.
5. Lack of Multi-lingual Semantic Resources.

Introduction
2. Hybrid Approaches: Keyword-based search is enriched with
Semantics-based search to retrieve more relevant results to the
user’s information needs.
2. Lack of Semantic Resources.
3. Priority of the employed techniques.
4. Incompleteness of Background Knowledge represented in
existing Semantic Resources.
5. Types of queries that the system can handle (Single-term vs.
Verbose queries).
6. Lack of Multi-lingual Semantic Resources.
 Research is very active in this area.
 Example: Dbpedia based search engine (June 2015)

Ir 01

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Ir 01

Similar to Ir 01 (20)

More from Mohammed Romi

More from Mohammed Romi (14)

Recently uploaded

Recently uploaded (20)

Ir 01