Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieval ch1

ETHIOPIAN POLICE UNIVERSITY
DEPARTMENT OF INFORMATION TECHNOLOGY AND CYBER SECURITY
Information Storage and Retrieval
Chapter One: Introduction to Information Storage and Retrieval

Chapters Point of Discussions
• IR and IR systems
• Data versus information retrieval
• IR and the retrieval process
• Basic structure of an IR system

Chapters objectives
• At the end of this chapter you should have a comprehensive
understanding of:
• Information Retrieval
• The differences between data and information retrieval
• The details of the retrieval process and
• The fundamental structure of IR systems.

Brainstorming
• Consider Google search engine as use case and discuss:
 How does Google decide which websites to show when you search
for something?
• What do you think makes a website more likely to appear at the
top?
 What do you think happens when you type a word into Google?
• Can you describe the steps from your search to the results you
see?
 What kinds of problems do you think Google might face when trying
to find and show the right information from millions of websites?

Brainstorming
• How does Google decide which websites to show when you search for
something? What do you think makes a website more likely to appear at the
top?
 Google uses a system called algorithms to rank websites.
 Relevance to the search term, the quality of its content, the number of other
sites linking to it, and how often it is updated are factors to determine the rank.
 Websites that provide valuable, trustworthy information are often ranked higher.
• What do you think happens when you type a word into Google? Can you
describe the steps from your search to the results you see?
 It quickly searches its massive index of web pages.
 It looks for pages that match your query, ranks them based on relevance, and
then displays a list of results on the search results page.
 This process happens just in seconds!

Brainstorming
• What kinds of problems do you think Google might face when trying to find
and show the right information from millions of websites?
 Google face challenges to provide comprehensive search results for
languages those lack extensive online content or digital resources.

Introduction
• Nowadays, enormous amounts of data are being generated
continuously from various sources such as social media platforms,
sensors and more.
 Data lacks value, if we can't access and search through it
effectively, which would be extremely challenging without
information retrieval systems.
• Information retrieval (IR) is the process of finding material (usually
documents) of an unstructured nature (usually text) that satisfies an
information need from large collections (usually stored on computers).
• Information retrieval deals with representation, storage, organization
of, and access to information items.
 The organization and access of information items should provide the user
with easy access to the information in which he/she is interested.

General Goal of IR Systems
• To help users find useful information based on their information
needs (with a minimum effort) despite
 Increasing complexity of Information
 Changing needs of user

Typical IR Task
IR
System
Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Given:
 A corpus of textual natural-
language documents.
 A user query in the form of
a textual string.
Find:
 A ranked set of documents
that are relevant to the
query

Data versus Information Retrieval
• Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data.

Data retrieval
 Consists mainly of determining which documents contain a set of keywords
in the user query (which is not enough to satisfy the user information need)
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies failure

Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated

Data versus information retrieval(cont’d…)
• Example of data retrieval system is a relational database
Criteria Data retrieval Information retrieval
Data Structured data Free text, unstructured
Result Exact matches Partial/Approximate matches
Accessibility Knowledgeable users Non-expert humans
Sensitivity Single error, total failure Small errors are unnoticed
Query language SQL(artificial) Natural

Examples of IR Systems
• Document-retrieval systems:
 Store entire documents
 Usually retrieve stored document by title or by key words
associated with the document.
• Reference retrieval systems:
 Store references to documents rather than the documents
themselves.
 Usually provide the titles of relevant documents and
frequently their physical locations.
 Extremely effective in libraries

Examples of IR Systems(cont’d…)
• Cross language information retrieval: designed to retrieve
information in one language based on queries formulated in
another language.
 Accept queries in user preferred language.
 Translates the query into the target language of the
document collection.
 Searches the documents for matches to the translated query.
 Rank retrieved documents based on relevance, considering
factors like keyword matching and context.

• Question-answering IR system: designed to provide specific answers to
user queries instead of just returning a list of documents.
 Processing: analyzing of the query to identify key concepts and intent.
 Retrieval: searches a structured or unstructured data source to find
relevant information.
• Ranking of retrieved documents on their relevance to question
using algorithms that assess factors like keyword matching,
context, and semantic meaning.
 Answer extraction: extraction of potential answers from the ranked
documents, focusing on sentences or phrases that directly respond to
the query.
 Response Generation: formats the final answer to ensure clarity and
conciseness.

• Image Retrieval: designed to search and retrieve images from a database or the
internet based on specific queries, often using visual content or metadata.
 Text-Based Image Retrieval: relies on metadata (titles, descriptions, tags)
associated with images.
 Searches for images that match the keywords or phrases provided by the
user.
 Content-Based Image Retrieval (CBIR): analyzes the visual content of images to
find matches.
 Utilizes features such as color, texture and shapes extracted from the
images.
 Retrieval Process:
 Index both visual features and associated metadata
 comparing the user’s input (text or visual) against the indexed images.
 retrieve images are ranked based on relevance to the query, considering both
visual similarity and textual metadata matches.

What makes IR hard?
• Query evaluation (or retrieval process)
– To what extent does a document correspond to a query?
– Simply, matching on words is a very hard approach as one
word can have different semantic meanings.
• System evaluation
– How good is a system?
– Are the retrieved documents relevant? (precision)
– Are all the relevant documents retrieved? (recall)
Intelligent IR:
Taking into account the meaning of the words used.
Taking into account the order of words in the query.

IR and the retrieval process(cont’d…)
• It is necessary to define the text database before any of the
retrieval processes are initiated.
• This is usually done by the manager of the database and includes
specifying the following
– The documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure and what
elements can be retrieved)
• The text operations transform the original documents and the
information needs and generate a logical view of them

• Once the logical view of the documents is defined, the database
module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used, but the most popular one
is the inverted file.
• Given that the document database is indexed, the retrieval process
can be initiated.

• The user first specifies a user need via the user interface which is
then parsed and transformed by the same text operation applied
to the text.
• Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated.
• The query is then processed to obtain the retrieved documents
(Searching).
• Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance

• The user then examines the set of ranked documents in the search
for useful information. Two choices for the user:
– reformulate query, run on entire collection or
– reformulate query, run on result set
• At this point, s/he might locate a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
• In such a cycle, the system uses the documents selected by the
user to change the query formulation.
• Modified query is assumed to be better representation of the real
user need than the previous one.

Basic Structure of an IR System
• An Information Retrieval System serves as a bridge between the world of
authors and the world of readers/users.
• IR system typically consists of three
main subsystems:
 Document representation
 Representation of users'
requirements (queries)
 The algorithms used to match user
requirements (queries) with
document representations.
We are IT professionals, nothing should be black box for us, we need to open it and see

Pros and cons of IR System
• Pros
– Fast Answers: super-fast and efficient at finding and bringing back the
exact information needed from huge amounts of data.
– 24/7 Availability: retrieval systems never take breaks.
• They are always active, standing by to retrieve information
whenever we require it, whether it's daytime or night-time.
• Cons
– Garbage In Garbage Out: greatly depends on the accuracy and
cleanliness of the data provided to generate meaningful results.
– Overreliance on Keywords: If search terms don’t match exactly,
crucial information will be missed.
– Information Overload Risk: retrieval of too much information.

Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieval ch1

More Related Content

Similar to Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieval ch1

More from shumawakjira26

Recently uploaded

Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieval ch1