SlideShare a Scribd company logo
Wollo University
KombolchaInstitute of T
echnology
College of Informatics
Department of Information T
echnology
1
Course Outline
Course Title:
Course Code:
Information Storage and Retrieval
ITec3081
ECTS Credits (CP): 5
Target Group: B.Sc. 3rd year
Information Technology Students (Regular
Program)
Year /Semester: Year: III, Semester: II
Status of the Course: Core
Contact Hours (per
week)
Section A S
e
c
t
i
Instructor: Habtamu Abate (M.Sc.)
Email: habate999@gmail.com
2
Course Outline
Course Description :
This course will cover introductory concepts of Information Storage
and Retrieval; automatic text operation including automatic indexing;
data and file structure for information retrieval; retrieval models;
evaluation of information retrieval systems and techniques for
enhancing retrieval effectiveness; query languages, query operations,
string manipulation and search algorithms; and Current Issues in IR
etc.
Course Objective
At the end of the course students will be able to:
Understand the various Information Retrieval Systems and
processes
Know the retrieval model and evaluation of Information Retrieval
Systems
Understand the processes of information storage and retrieval
Design ,develop and evaluate information retrieval models
Understand evaluation issues in IR
Understand current issues in IR 3
Course Syllabus
Chapter One
Introduction to ISR
IR and IR systems
Data versus information retrieval
IR and the retrieval process
Basic structure of an IR system
Chapter Two
Text/Document Operations and
Automatic Indexing
Index term selection (Luhn’s selection and
Zipf’s law in IR)
Document pre-processing (Lexical
analysis, Stop word Elimination, stemming)
Term extraction (Term weighting and
similarity measures)
Chapter Three
Indexing Structures
Inverted files
Tries, Suffix Trees and Suffix Arrays
Signature files
4
Course Syllabus
Chapter Four
IR Models
Introduction of IR Models
Chapter Five
Retrieval
Evaluation
Chapter Six
Query Languages
Chapter Seven
Query Operations
Chapter Eight
Current Issues in IR
Boolean model
Vector space model
Probabilistic model
Evaluation of IR systems
Relevance judgment
Performance measures (Recall, Precision, etc.)
Keyword-based queries
Pattern matching
Structural queries
Relevance feedback
Query expansion
Research in IR (Multimedia Retrieval, Web
Retrieval, Question answering. etc.)
5
Assessment
Assignments=10% ,Test=10% / Lab Exam=10%, Project work= 20 % ; Mid
Exam=20% ; Final examination= 40%
Text Book
Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information
Retrieval, ACM Press.,2008.
Other Reference Books:
Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval,
McGraw-Hill Co., 1983.
Robert R. Korfhage, Information Storage and Retrieval, John Wiley and
Sons, 1997.
Information Retrieval: Data Structures and Algorithms by W. B. Frakes
and R. Baeza-Yates (Eds.) (Prentice-Hall) 1992, ISBN 0-13-463837-9.
Spärck Jones, K. and Willett, P. (eds.). Readings in information retrieval.
San Francisco: Morgan Kaufmann, 1997.
6
Chapter 1: Introduction to Information
Storage and retrieval
Outline
• IR and IR systems
• Data versus information retrieval
• IR and the retrieval process
• Basic structure of an IR
7
Information Retrieval
Information retrieval (IR) is the process of
finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need from within large collections
(usually stored on computers).
Information is organized into (a large number of)
documents
 Large collections of documents from various
sources: news articles, research papers, books,
digital libraries, Web pages, etc.
 Example: Web Search Engines like Google claim to
index over 1 Trillion pages 8
Information Retrieval
Document: Any object that can be
stored and can be retrieved
• Information Need: What kind of
information you need to find
– what you have in mind
• Query: Statement of an information
need or representation of an
information need.
• The user has an information need (IN)
in his/her mind.
The Retrieval System can’t
understand the IN directly.
• IN has to be abstracted into a form
that matches the information system
• This abstraction is a query
for
Examples
Good textbook
Information retrieval
9
General Goal of Information Retrieval
To help users find useful information based
on their information needs (with a minimum
effort) despite:
 Increasing complexity of Information
 Changing needs of user
 Provide immediate random access to the
document collection.
10
Information Retrieval Systems?
• Document (Web page)
retrieval in response to a
query
– Quite effective (at
some things)
– Commercially
successful (some of
them)
• But what goes on behind
the scenes?
– How do they work?
– What happens beyond
the Web?
11
Web search systems
 Lycos, Excite, Yahoo, Google, Live,
Northern Light, Teoma, HotBot, Baidu,
…
Examples of IR systems
Conventional (library catalog): Search by keyword,
title, author, etc.
Text-based (Lexis-Nexis, Google, FAST): Search
by keywords. Limited search using queries in natural
language.
Multimedia (QBIC, WebSeek, SaFe): Search b
y
visual appearance (shapes, colors,… ).
Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
Other:
Cross language information retrieval,
Music retrieval
12
Information Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather
than on the retrieval of data
■ Data retrieval
Consists mainly of determining which documents contain a
set of keywords in the user query (which is not enough to
satisfy the user information need)
Aims at retrieving all objects that satisfy well defined
semantics
a single erroneous object among a thousand retrieved
objects implies failure
■ Information retrieval
Is concerned with retrieving information about a subject or
topic than retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might
be inaccurate
small errors are tolerated
13
Information Retrieval vs. Data Retrieval
• Example of data retrieval system is a relational database
Data Retrieval Info
Retrieval
Data organization Structured Unstructured
Fields Clear Semantics
(ID, Name, age,…)
No fields (other
than text)
Query Language Artificial (defined,
SQL)
Free text (“natural
language”), Boolean
Matching Exact (results are
always “correct”)
Partial match, best match
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
14
Basic Concepts in Information
Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing
Retrieval
Browsing
DB
USER
15
The User Task
Retrieval
It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
The user of a retrieval system has to translate his
information need into a query in the language provided
by the system.
In this context (i.e. by specifying a set of words), the
user searches for useful information executing a
retrieval task
English Language Statement :
I want a book by Hadis Alemayehu titled Fikir Esk
Mekaber
16
Browsing
It is the process of retrieving information, whereby
the main objective is not clearly defined from the
beginning and whose purpose might change during the
interaction with the system.
E.g. User might search for documents about ‘car
racing’. Meanwhile S/he might find interesting
documents about ‘car manufacturers’. While reading
about car manufacturers in Addis, S/he might turn
his/her attention to a document providing ‘direction
to Addis’, and from this to documents which cover
‘Tourism in Ethiopia’.
In this context, user is said to be browsing in the
collection and not searching, since a user may has an
interest glancing around
17
Logical View of Documents
■ Documents in a collection are frequently represented by a
set of index terms or keywords
■ Such keywords are mostly extracted directly from the
text of the document
■ These representative keywords provide a logical view of
the document
■ Document representation viewed as a continuum, in which
logical view of documents might shift from full text to
index terms
Tokenization stemming Indexing
Docs stop words
Full
text
Index terms
18
Logical view of documents
• If full text :
– Each word in the text is a keyword
– Most complex form
– Expensive
• If full text is too large, the set of
representative keywords can be reduced
through transformation process called text
operation
• It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms
19
Structure of an IR System
An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users,
That is, writers present a set of ideas in a document using a set
of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
Black box
User Documents
The black box is the information retrieval system.
To be effective in its attempt to satisfy information need of users,
the IR system must ‘interpret’ the contents of documents in a
collection and rank them according to their degree of relevance to the
user query.
Thus the notion of relevance is at the centre of IR
The primary goal of an IR system is to retrieve all the documents
which are relevant to a user query while retrieving as few non-
relevant documents as possible
20
Typical IR Task
Given:
A corpus of textual natural-language
documents.
A user query in the form of a textual
string.
Find:
A ranked set of documents that are
relevant to the query.
21
Typical IR System Architecture
IR
System
Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
22
Web Search System
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider
23
What is Information Retrieval ?
• A good formal definition of information retrieval is given
in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation,
storage, organization of, and access to information
items. The organization and access of information items
should provide the user with easy access to the
information in which S/he is interested”
• The definition incorporates all important features of a
good information retrieval system
– Representation
– Storage
– Organization
– Access
• The focus is on the user information need
24
Overview of the Retrieval process
25
• It is necessary to define the text database before
any of the retrieval processes are initiated
• This is usually done by the manager of the database
and includes specifying the following
– The documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure
and what elements can be retrieved)
• The text operations transform the original
documents and the information needs and generate a
logical view of them
The Retrieval Process
26
• Once the logical view of the documents is defined, the
database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used , but the most
popular one is the inverted file.
• Given that the document database is indexed, the
retrieval process can be initiated
The user first specifies a user need which is then parsed
and transformed by the same text operation applied to
the text
Next the query operations is applied before the actual
query, which provides a system representation for the
user need, is generated
The Retrieval Process …
27
• The query is then processed to obtain the retrieved documents
Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the
search for useful information. Two choices for the user:
(i) reformulate query, run on entire collection or
(ii) reformulate query, run on result set
At this point, s/he might pinpoint a subset of the documents
seen as definitely of interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by
the user to change the query formulation.
Hopefully, this modified query is a better representation of
the real user need
The Retrieval Process …
28
User
Interface
Text Operations
Query Language
& Operations Indexing
Searching
Ranking
Index
Text
Query
User
need
User
feedback
Ranked docs
Retrieved docs
Logical view
logical view
Inverted file
DB
manager
Module
Text
Database
Text
Detail view of the Retrieval
Process
29
Subsystems of an IR system
• The two subsystems of an IR system:
–Searching: is an online process of finding relevant
documents in the index list as per users query
–Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
• Indexing and searching: are unavoidably connected
–you cannot search what was not first indexed in
some manner or other
–indexing of documents or objects is done in
order to be searchable
–there are many ways to do indexing
–to index one needs an indexing language
–there are many indexing languages
–even taking every word in a document is an indexing language
• Knowing searching is knowing indexing 30
Indexing Subsystem
Documents
Tokenize
Stop list
Stemming & Normalize
Term weighting
Index
text
non-stoplist
tokens
tokens
stemmed
terms
terms with
weights
Assign document identifier
documents
document
IDs
31
Searching Subsystem
Index
Stemming & Normalize
stemmed terms
Stop list
non-stoplist
tokens
Similarity
Measure
ranking
Index terms
query parse query
query tokens
ranked
document
set
relevant
document set
Query Term weighting
terms
32
Issues that arise in IR
Text representation
what makes a “good” representation?
how is a representation generated from text?
what are retrievable objects and how are they
organized?
information needs representation
what is an appropriate query language?
how can interactive query formulation and
refinement be supported?
Comparing representations (to identify relevant
documents)
What weighting scheme and similarity measure to
be used?
what is a “good” model of retrieval?
Evaluating effectiveness of retrieval
what are good metrics?
what constitutes a good experimental test bed?
33
Focus in IR System Design
Our focus during IR system design is:
• In improving performance effectiveness of the system
–Effectiveness of the system is measured in terms of
precision, recall, …
–Stemming, stop words, weighting schemes, matching
algorithms
• In improving performance efficiency
–The concern here is storage space usage, access time,
searching time, data transfer time …
–Concern regarding space – time tradeoffs !!
–Use Compression techniques, data/file structures, etc.
34
35
Thank you

More Related Content

Similar to Chapter 1.pptx

CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrieval
Malobe Lottin Cyrille Marcel
 
Hci encyclopedia irshortefords
Hci encyclopedia irshortefordsHci encyclopedia irshortefords
Hci encyclopedia irshortefordsapollobgslibrary
 
Hci encyclopedia irshortefords
Hci encyclopedia irshortefordsHci encyclopedia irshortefords
Hci encyclopedia irshortefordsapollobgslibrary
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.Lanujessy
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
Ibrahim ramadan paper
Ibrahim ramadan paperIbrahim ramadan paper
Ibrahim ramadan paper
Ibrahim Ramadan Abd-Elhamid
 
Web Mining
Web MiningWeb Mining
Web Mining
Shobha Rani
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
Yi Zeng
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
SamuelKetema1
 
Information storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdfInformation storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdf
SURAJDHIKAR1
 
The design and implementation of database on library
The design and implementation of database on libraryThe design and implementation of database on library
The design and implementation of database on library
Phouvanh Korngchansavath
 
Hci
HciHci
Introduction.pptx
Introduction.pptxIntroduction.pptx
Introduction.pptx
Mahsadelavari
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
I1802055259
I1802055259I1802055259
I1802055259
IOSR Journals
 

Similar to Chapter 1.pptx (20)

Mam assign
Mam assignMam assign
Mam assign
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrieval
 
Hci encyclopedia irshortefords
Hci encyclopedia irshortefordsHci encyclopedia irshortefords
Hci encyclopedia irshortefords
 
Hci encyclopedia irshortefords
Hci encyclopedia irshortefordsHci encyclopedia irshortefords
Hci encyclopedia irshortefords
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Ibrahim ramadan paper
Ibrahim ramadan paperIbrahim ramadan paper
Ibrahim ramadan paper
 
Web Mining
Web MiningWeb Mining
Web Mining
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Information storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdfInformation storage and retrieval PPT.pdf
Information storage and retrieval PPT.pdf
 
The design and implementation of database on library
The design and implementation of database on libraryThe design and implementation of database on library
The design and implementation of database on library
 
Hci
HciHci
Hci
 
Introduction.pptx
Introduction.pptxIntroduction.pptx
Introduction.pptx
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
 
I1802055259
I1802055259I1802055259
I1802055259
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Lec1
Lec1Lec1
Lec1
 

More from Habtamu100

qury.pdf
qury.pdfqury.pdf
qury.pdf
Habtamu100
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
Habtamu100
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
Habtamu100
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
Habtamu100
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
Habtamu100
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 

More from Habtamu100 (7)

qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 

Recently uploaded

Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 

Recently uploaded (20)

Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 

Chapter 1.pptx

  • 1. Wollo University KombolchaInstitute of T echnology College of Informatics Department of Information T echnology 1
  • 2. Course Outline Course Title: Course Code: Information Storage and Retrieval ITec3081 ECTS Credits (CP): 5 Target Group: B.Sc. 3rd year Information Technology Students (Regular Program) Year /Semester: Year: III, Semester: II Status of the Course: Core Contact Hours (per week) Section A S e c t i Instructor: Habtamu Abate (M.Sc.) Email: habate999@gmail.com 2
  • 3. Course Outline Course Description : This course will cover introductory concepts of Information Storage and Retrieval; automatic text operation including automatic indexing; data and file structure for information retrieval; retrieval models; evaluation of information retrieval systems and techniques for enhancing retrieval effectiveness; query languages, query operations, string manipulation and search algorithms; and Current Issues in IR etc. Course Objective At the end of the course students will be able to: Understand the various Information Retrieval Systems and processes Know the retrieval model and evaluation of Information Retrieval Systems Understand the processes of information storage and retrieval Design ,develop and evaluate information retrieval models Understand evaluation issues in IR Understand current issues in IR 3
  • 4. Course Syllabus Chapter One Introduction to ISR IR and IR systems Data versus information retrieval IR and the retrieval process Basic structure of an IR system Chapter Two Text/Document Operations and Automatic Indexing Index term selection (Luhn’s selection and Zipf’s law in IR) Document pre-processing (Lexical analysis, Stop word Elimination, stemming) Term extraction (Term weighting and similarity measures) Chapter Three Indexing Structures Inverted files Tries, Suffix Trees and Suffix Arrays Signature files 4
  • 5. Course Syllabus Chapter Four IR Models Introduction of IR Models Chapter Five Retrieval Evaluation Chapter Six Query Languages Chapter Seven Query Operations Chapter Eight Current Issues in IR Boolean model Vector space model Probabilistic model Evaluation of IR systems Relevance judgment Performance measures (Recall, Precision, etc.) Keyword-based queries Pattern matching Structural queries Relevance feedback Query expansion Research in IR (Multimedia Retrieval, Web Retrieval, Question answering. etc.) 5
  • 6. Assessment Assignments=10% ,Test=10% / Lab Exam=10%, Project work= 20 % ; Mid Exam=20% ; Final examination= 40% Text Book Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press.,2008. Other Reference Books: Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill Co., 1983. Robert R. Korfhage, Information Storage and Retrieval, John Wiley and Sons, 1997. Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R. Baeza-Yates (Eds.) (Prentice-Hall) 1992, ISBN 0-13-463837-9. Spärck Jones, K. and Willett, P. (eds.). Readings in information retrieval. San Francisco: Morgan Kaufmann, 1997. 6
  • 7. Chapter 1: Introduction to Information Storage and retrieval Outline • IR and IR systems • Data versus information retrieval • IR and the retrieval process • Basic structure of an IR 7
  • 8. Information Retrieval Information retrieval (IR) is the process of finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Information is organized into (a large number of) documents  Large collections of documents from various sources: news articles, research papers, books, digital libraries, Web pages, etc.  Example: Web Search Engines like Google claim to index over 1 Trillion pages 8
  • 9. Information Retrieval Document: Any object that can be stored and can be retrieved • Information Need: What kind of information you need to find – what you have in mind • Query: Statement of an information need or representation of an information need. • The user has an information need (IN) in his/her mind. The Retrieval System can’t understand the IN directly. • IN has to be abstracted into a form that matches the information system • This abstraction is a query for Examples Good textbook Information retrieval 9
  • 10. General Goal of Information Retrieval To help users find useful information based on their information needs (with a minimum effort) despite:  Increasing complexity of Information  Changing needs of user  Provide immediate random access to the document collection. 10
  • 11. Information Retrieval Systems? • Document (Web page) retrieval in response to a query – Quite effective (at some things) – Commercially successful (some of them) • But what goes on behind the scenes? – How do they work? – What happens beyond the Web? 11 Web search systems  Lycos, Excite, Yahoo, Google, Live, Northern Light, Teoma, HotBot, Baidu, …
  • 12. Examples of IR systems Conventional (library catalog): Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST): Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe): Search b y visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, Answerbus): Search in (restricted) natural language Other: Cross language information retrieval, Music retrieval 12
  • 13. Information Retrieval vs. Data Retrieval Emphasis of IR is on the retrieval of information, rather than on the retrieval of data ■ Data retrieval Consists mainly of determining which documents contain a set of keywords in the user query (which is not enough to satisfy the user information need) Aims at retrieving all objects that satisfy well defined semantics a single erroneous object among a thousand retrieved objects implies failure ■ Information retrieval Is concerned with retrieving information about a subject or topic than retrieving data which satisfies a given query semantics is frequently loose: the retrieved objects might be inaccurate small errors are tolerated 13
  • 14. Information Retrieval vs. Data Retrieval • Example of data retrieval system is a relational database Data Retrieval Info Retrieval Data organization Structured Unstructured Fields Clear Semantics (ID, Name, age,…) No fields (other than text) Query Language Artificial (defined, SQL) Free text (“natural language”), Boolean Matching Exact (results are always “correct”) Partial match, best match Query specification Complete Incomplete Items wanted Matching Relevant Accuracy 100% < 50% Error response Sensitive Insensitive 14
  • 15. Basic Concepts in Information Retrieval: (i) User Task and (ii) Logical View of documents The User Task: two user task – retrieval and browsing Retrieval Browsing DB USER 15
  • 16. The User Task Retrieval It is the process of retrieving information whereby the main objective is clearly defined from the onset of searching process. The user of a retrieval system has to translate his information need into a query in the language provided by the system. In this context (i.e. by specifying a set of words), the user searches for useful information executing a retrieval task English Language Statement : I want a book by Hadis Alemayehu titled Fikir Esk Mekaber 16
  • 17. Browsing It is the process of retrieving information, whereby the main objective is not clearly defined from the beginning and whose purpose might change during the interaction with the system. E.g. User might search for documents about ‘car racing’. Meanwhile S/he might find interesting documents about ‘car manufacturers’. While reading about car manufacturers in Addis, S/he might turn his/her attention to a document providing ‘direction to Addis’, and from this to documents which cover ‘Tourism in Ethiopia’. In this context, user is said to be browsing in the collection and not searching, since a user may has an interest glancing around 17
  • 18. Logical View of Documents ■ Documents in a collection are frequently represented by a set of index terms or keywords ■ Such keywords are mostly extracted directly from the text of the document ■ These representative keywords provide a logical view of the document ■ Document representation viewed as a continuum, in which logical view of documents might shift from full text to index terms Tokenization stemming Indexing Docs stop words Full text Index terms 18
  • 19. Logical view of documents • If full text : – Each word in the text is a keyword – Most complex form – Expensive • If full text is too large, the set of representative keywords can be reduced through transformation process called text operation • It reduce the complexity of the document representation and allow moving the logical view from that of a full text to a set of index terms 19
  • 20. Structure of an IR System An Information Retrieval System serves as a bridge between the world of authors and the world of readers/users, That is, writers present a set of ideas in a document using a set of concepts. Then Users seek the IR system for relevant documents that satisfy their information need. Black box User Documents The black box is the information retrieval system. To be effective in its attempt to satisfy information need of users, the IR system must ‘interpret’ the contents of documents in a collection and rank them according to their degree of relevance to the user query. Thus the notion of relevance is at the centre of IR The primary goal of an IR system is to retrieve all the documents which are relevant to a user query while retrieving as few non- relevant documents as possible 20
  • 21. Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query. 21
  • 22. Typical IR System Architecture IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . 22
  • 23. Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider 23
  • 24. What is Information Retrieval ? • A good formal definition of information retrieval is given in Baeze-Yates & Riberio-Neto (1990, p1) “Information retrieval deals with representation, storage, organization of, and access to information items. The organization and access of information items should provide the user with easy access to the information in which S/he is interested” • The definition incorporates all important features of a good information retrieval system – Representation – Storage – Organization – Access • The focus is on the user information need 24
  • 25. Overview of the Retrieval process 25
  • 26. • It is necessary to define the text database before any of the retrieval processes are initiated • This is usually done by the manager of the database and includes specifying the following – The documents to be used – The operations to be performed on the text – The text model to be used (the text structure and what elements can be retrieved) • The text operations transform the original documents and the information needs and generate a logical view of them The Retrieval Process 26
  • 27. • Once the logical view of the documents is defined, the database module builds an index of the text – An index is a critical data structure – It allows fast searching over large volumes of data • Different index structures might be used , but the most popular one is the inverted file. • Given that the document database is indexed, the retrieval process can be initiated The user first specifies a user need which is then parsed and transformed by the same text operation applied to the text Next the query operations is applied before the actual query, which provides a system representation for the user need, is generated The Retrieval Process … 27
  • 28. • The query is then processed to obtain the retrieved documents Before the retrieved documents are sent to the user, the retrieved documents are ranked according to the likelihood of relevance The user then examines the set of ranked documents in the search for useful information. Two choices for the user: (i) reformulate query, run on entire collection or (ii) reformulate query, run on result set At this point, s/he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle In such a cycle, the system uses the documents selected by the user to change the query formulation. Hopefully, this modified query is a better representation of the real user need The Retrieval Process … 28
  • 29. User Interface Text Operations Query Language & Operations Indexing Searching Ranking Index Text Query User need User feedback Ranked docs Retrieved docs Logical view logical view Inverted file DB manager Module Text Database Text Detail view of the Retrieval Process 29
  • 30. Subsystems of an IR system • The two subsystems of an IR system: –Searching: is an online process of finding relevant documents in the index list as per users query –Indexing: is an offline process of organizing documents using keywords extracted from the collection • Indexing and searching: are unavoidably connected –you cannot search what was not first indexed in some manner or other –indexing of documents or objects is done in order to be searchable –there are many ways to do indexing –to index one needs an indexing language –there are many indexing languages –even taking every word in a document is an indexing language • Knowing searching is knowing indexing 30
  • 31. Indexing Subsystem Documents Tokenize Stop list Stemming & Normalize Term weighting Index text non-stoplist tokens tokens stemmed terms terms with weights Assign document identifier documents document IDs 31
  • 32. Searching Subsystem Index Stemming & Normalize stemmed terms Stop list non-stoplist tokens Similarity Measure ranking Index terms query parse query query tokens ranked document set relevant document set Query Term weighting terms 32
  • 33. Issues that arise in IR Text representation what makes a “good” representation? how is a representation generated from text? what are retrievable objects and how are they organized? information needs representation what is an appropriate query language? how can interactive query formulation and refinement be supported? Comparing representations (to identify relevant documents) What weighting scheme and similarity measure to be used? what is a “good” model of retrieval? Evaluating effectiveness of retrieval what are good metrics? what constitutes a good experimental test bed? 33
  • 34. Focus in IR System Design Our focus during IR system design is: • In improving performance effectiveness of the system –Effectiveness of the system is measured in terms of precision, recall, … –Stemming, stop words, weighting schemes, matching algorithms • In improving performance efficiency –The concern here is storage space usage, access time, searching time, data transfer time … –Concern regarding space – time tradeoffs !! –Use Compression techniques, data/file structures, etc. 34