Information storage and retrieval

INFORMATION STORAGE
AND RETRIEVAL SYSTEM
Dr. Utpal Das
Dibrugarh University,
Dibrugarh, Assam
utpalishaan@gmail.com

Break up of Terminology
INFORMATION /STORAGE/ RETRIEVAL /SYSTEM

MEDIA DATABASES: Bibliographic
Full Text
STORAGE stand-alone databases
hypertext networked databases
SYSTEM DBMS
CLASSIFICATION SCHEMES
INDEXES
Books, Journals, Articles, Audio, Video, Cartographs
Text, Sound, Image, Data

RETRIEVAL Recall
Searching
Recovering
Interpreting
Query Analysis

System Mechanism
Framework
Mode of Arrangement
Interconnected Network
A set of Principle or Procedure
Organized scheme or Method
Modus Operandi

Genesis
The term “Information Retrieval System” was coined by
Calvin Mooers in 1952.
IRS gained popularity in the research community in the
early sixties only when computers were being introduced
in information handling and management.
These information retrieval systems are basically nothing
but document retrieval system, since they were designed
to retrieve bibliographic information of stored documents
databases in response to a search request by the users.

Genesis
Though the basics of IRS is still the same, due to application
of present advanced techniques , the role and scope of IRS
has been much widened. Therefore the connotation of
information retrieval has changed and it has been variously
termed by information professionals and researchers, like:
Information Storage and Retrieval System,
Information Organization and Retrieval System,
Information Processing and Retrieval System,
Text Retrieval System,
Information Representation and Retrieval
System,
Information Access System.

Genesis
The modern connotations implies that IRS presently
deals not only with textual information but also with
multimedia information comprising text, audio, images
and video.
While many features of conventional text retrieval
systems are equally applicable to multimedia information
retrieval, the specific nature of audio, image and video
information have called for the development of many
new tools and techniques for information retrieval.
Thus, modern information retrieval systems deal with
storage, organization and access to text, as well as
multimedia information resources.

Meaning, Definition and Concept of ISRS
 ISRS is a selective, systematic recall of logically stored
information
 ISRS is the science of searching for information in
documents, searching for documents themselves,
searching for metadata which describe documents, or
searching within databases, whether relational stand-
alone databases or hypertext networked databases such
as the Internet or World Wide Web or intranets, for text,
sound, images or data

 An ISRS is an information system, that is, a system used to
store items of information that need to be processed,
searched, retrieved, and disseminated to various user
populations
 It is a process of searching some collection of documents,
using the term document in its widest sense, in order to
identify those documents which deal with a particular
subject. Any system that is designed to facilitate this
literature searching may legitimately be called an
information retrieval system.

ISRS is the study of systems for indexing, searching, and
recalling data, particularly text or other unstructured
forms
Information retrieval may be defined as the technique
and process of searching, recovering, and interpreting
information from large amounts of stored data.
It is recovery of information, especially in a database
stored in a computer

IR is essentially concerned with structure and operation
for devices to select the documentary information and
response to search query
IRS does not inform the user on (change the knowledge
of) subject of his enquiry, it merely inform him of the
existence or non existence and where about of
document relating to his request.

An information retrieval system is designed to
analyse, process and store sources of information
and retrieve those that match a particular user’s
requirements
[Chowdhury, G.G. (2004). Introduction to modern
information retrieval. 2nd ed. London: Facet
Publishing. 2004].

Basic aspects of ISRS:
Information Storage and Retrieval (ISAR) system deals with
three basic aspects:
Information representation
Information storage and organisation
Information access.

BROAD OUTLINE
Information
sources
Analysis &
Representation
Organised
Information
Retrieved
Information Matching
Users Query
Analysis
Analysed
Queries

Functional View of Standard IR System

CHARACTERISTICS OF ISAR SYSTEMS
Information Facilitator
The ISAR system should act as facilitator between the
information (contained in document) and the users. If a
user approaches with the subject term, name of contributors
or title of the document and so on, the system should be
helpful to give him the desired information. The information
could be exact information or the reference of a document
which contains information

Non-Ambiguous
The system should be so organized that ambiguity of
information is avoided so that search result is free from
any kind of ambiguity. This requires identification of
terms, setting their context and their proper indexing.
For example, search for a term ‘screw driver’ should not
bring results like ‘truck driver’, ‘hardware driver’ and so
on.

OBJECTIVES OF ISAR SYSTEMS
Minimum Time
The system should be so designed that minimum effort and
time are spent to interrogate the system. Searching
through the system should take minimum time, meaning
thereby that the ISAR should be capable of performing
fast search. Not only that, it is best to have an online
ISAR so that users do not need to walk to library. They
should get whatever they want at there work place.

OBJECTIVES OF ISAR SYSTEMS
User Friendliness
Ease of use is an important consideration for any ISAR system.
Any ISAR should have user friendly interface. The important
aspects of ISAR should be highlighted. Before a user uses the
system he/she should be properly introduced to the system
with all its features, i.e., informing users about the scope of
system, available search options, and most importantly how
to perform search with the system. It is only this interface
through which a user operates an ISAR system. Take an
example of a Library OPAC. It should have following features:
Introduction to library
Scope of collection
Instructions for performing search

User Friendliness
The search interface should facilitate framing the search
like:
Keyword search
Author and title search
Combination search (using Boolean operators)
Proximity search, etc.

Others
The desirability of making systems as readily usable as
possible for their clienteles
The need to recognise basic features of retrieval system
To incorporate coordinating features such as vocabulary
control, search strategies, user-interface, information
modelling aspects in general, etc.

The competence and compatibility for consolidated
searching and retrieval of information from any client
terminal from any database within the system.
It should be able to narrowcast or broadcast or relate the
information need in a variety of associations to get
optimum retrieval performance.
It should have access facilities at multi-points.
It should have common command language facility to
retrieve information from several databases of the
system

It should be able to handle information access from entity-
related or object-oriented approaches. It may also
provide all other associations for accessing information.
In a bibliographic or full-text database, the surrogates
chosen should have indicative as well as informative
features that are sufficient enough to select or reject the
retrieving information based on end-users’ needs.
It should have the ability to select, classify, process and
consolidate the analysed information into a cohesive text
ready for assimilation by the end-users.

It should have ability to orient the information to specialist
needs of the users from time to time. This calls for
understanding the processing of user profiles.
It should be able to retrieve maximum information with
minimum number of clues.
The fuzzy approaches of end-users must be able to get
clarified and ultimate result should provide satisfaction
to the searcher.
It should have capacity to interchange the information
available in one database or another for purposes of
retrieval relevance end usage.

It should have bibliographic data interchange capacity
(using Z39.50 or similar standard) to meet consolidation
to a chosen format for networking and other purposes.
Compatibility with standards at all levels must be the goal.
It should have ability to search simple information quickly
in an easy manner and also have the ability to multi-
track the complex questions and present them in a
simple easy manner. User-friendly presentations are very
important.

FUNCTIONS
To identify the information (sources) relevant to the areas
of interest of the target user’s community; this is a
challenging job especially in the web environment
where virtually everybody in the world can be the
potential user of a web based information retrieval
system.
To analyse the contents of the sources (documents); this is
becoming increasingly challenging as the size, volume
and variety of information sources (documents) is
increasing rapidly; web information retrieval is carried
out automatically using specially designed programs
called spiders.

FUNCTIONS
To represent the contents of analysed sources in a way that
matches users’ queries; this is done by automatically
creating one or more index files, and is becoming an
increasingly complex task due to the volume and variety
of content and increasing user demands.
To analyse users’ queries and represent them in a form that
will be suitable for matching the database; this is done in
a number of ways, through the design of sophisticated
search interfaces including those that can provide some
help to users for selection of appropriate search terms by
using dictionary and thesauri, automatic spell checkers, a
predefined set of search statements and so forth.

FUNCTIONS
To match the search statement with the stored database; a
number of complex information retrieval models have
been developed over the years that are used to determine
the similarity of the query and stored documents.
To retrieve relevant information; a variety of tools and
techniques are used to determine the relevance of
retrieved items and their ranking.
To make continuous changes in all aspects of the system,
keeping in mind the rapid developments in information
and communication technologies (ICTs) relating to
changing patterns of society, users and their information
needs and expectations.

Design of Information Retrieval System
To design and develop an ISAR system one needs to
recognize the need of the users as all the
subsequent activities are dependent upon these.
When designing, ISAR systems should follow system
development life cycle (SDLC) for greater
efficiency and effectiveness of the systems.

System Development Life Cycle Phases:

1. System Planning:
i. Defining the problems,
ii. Objectives and need
iii. Resources (such as personnel
and costs).
After analyzing data for planning one will have three
choices:
Develop a new system,
Improve the current system or
leave the system as it is.

2. System Analysis:
i. Determining end-user’s requirements,
ii. Their expectations from the system,
iii. Performance of the System
iv. Feasibility study
3. System Design:
i. Elements of a system,
ii. Components,
iii. Security level,
iv. Modules,
v. Architecture
vi. Interfaces
vii. Type of data
(system design meets all functional and technical requirements,
logically and physically)

4. Implementation and Deployment
i. it’s the actual construction process
ii. In Software Development Life Cycle, the
actual code is written here
iii. In Hardware Development Life Cycle, the
implementation phase will contain
configuration and fine-tuning
iv. System becomes ready to become running,
live and productive

5. System Testing and Integration
i. Introducing the system to different inputs
ii. obtaining its outputs and analyze behavior
iii. Observing the way it functions
(Testing is important to ensure customer’s satisfaction,
and it requires no knowledge in coding, hardware
configuration or design)
6. System Maintenance
i. periodic maintenance to prevent redundancy
ii. Replacing the old hardware
iii. Periodical evaluation of system’s performance,
iv. latest updates for certain components with latest
technologies to face current security threats.

Steps for Design of Information Retrieval System
Steps for designing an Information Retrieval System:
i. Recognizing the need for development of ISAR system
ii. Recognizing the information needs of the users
iii. Identification of users need
iv. Type(s) of databases to be incorporated into the system
v. Features to be incorporated in the databases
vi. Preparation of structured queries
vii. Design and development of various components of the
system such as user interface, search agent, etc.
viii. Evaluation of the system
ix. Re-designing/Modification of ISAR system, if needed.

Need & Purpose
The basic purpose of ISRS is the satisfy information needs
of various classes of Users:
a) Current Information Need,
b) Exhaustive Information Need,
c) Every day Information Need, and
d) Catching-up or Brushing-up Information
Need

Need & Purpose
An IRS is designed to retrieve the documents or information
required by the user community.
It should make the right information available to the right
user. Thus, an information retrieval system aims to collect
and organize information in one or more subject areas in
order to provide it to users as soon as they ask for it.
A writer presents a set of ideas in a document using a set of
concepts.

Need & Purpose
Somewhere there are users who require the ideas but
may not be able to identify them; in other words ,
some people lack the ideas put forward by the
author in their work.
IRS match the writer’s ideas expressed in the
document with the user’s requirements for them.
Thus, an IRS serves as a bridge between the world of
creators or generators of information and the users
of that information.

Components for Design of ISRS
An ISAR system has 3 basic components:
I. User Interface
II. Knowledge Base
III. Search Agent

I. User Interface:
User interface is the front page or the front-end or (User’s)
operational area of the system which enables user to
put a query and displays results.
It is of two types:
i. Query Interface
ii. Result Interface

i. Query Interface:
This is the end from where users enter his/her search
terms and initiate communication with the system. The
Query Interface generally need to have following
features:
a) Understanding the user input statement
This front-end interface needs to understand the
keywords given by the users and capture them to pass
on to the search program. The front-end should have
understandable look and feel, distinguishable colour
combinations, and search instructions.

b) Refining the problem statement
The interface should have ability or flexibility for further
refining any query or statement, narrow down from broader
to specific search or further modification within the displayed
search results with some kind of arrangement among topical
terms which further facilitate browsing through the system.
c) Search statement to search strategy translation
The system front-end should have the ability to translate a
search statement and formulate a search strategy in the
programming language which is understood by Search Agent.
For example, interfaces built in a Relational Database
Management System (RDBMS) environment, accepts search
statement in Structured Query Language (SQL) format and
formulate the search strategy with the help of Search Agent
(like Boolean Operators or any other algorithms) .

d) Modification of search strategy
If one does not get desired output from the database, ISAR
system should have procedure for further modification of
search strategy. The modification should be interactive.
Vocabulary control devices can also be added as an aid
for users to locate the term of his/her interest.
For Example: Modifying search with the help of other
options like ‘Contains’, ‘Exact’, ‘Begins with’, ‘Ends with’,
etc.

ii. Result Interface
In the Result Interface, display of search results
should be user friendly.
Not only that the result should cater the needs of
individual users but the display should also be
customized (like e-resource publishers interface).
Search results should also display the ratings in the
light of search terms. For this purpose statistical
techniques can be used.

II. Knowledge Base
The store house of any ISAR system is its Knowledge Base. It
contains list of facts or related facts (information). Any kind of
query is answered based on the facts stored in the Knowledge
Base. A Knowledge Base could be a Database Management
System (DBMS).
knowledge base (KB) is a technology used
to store complex structured and unstructured information used
by a computer system.
A knowledge-based system consists of a knowledge-base that
represents facts about the world and an inference engine that
can reason about those facts and use rules and other forms of
logic to deduce new facts or highlight inconsistencies

Retrieval of information from storage depends
on two important aspects of Knowledge Base:
A. Knowledge Representation
B. Indexing and Clustering

A. Knowledge Representation:
The first and foremost objective in constructing an
ISAR system is representation of facts within the
Knowledge Base.
There are different ways of representation of
knowledge:
a) Semantic Network Knowledge Representation
b) Frame Based Knowledge Representation
c) Rule-Based Knowledge Representation

a) Semantic Network Knowledge Representation
Semantic network is a method of knowledge representation
based on a network structure. A semantic network
contains points called nodes connected by links called
as arcs. The nodes represent objects, concepts or
events - in other words documents or information. The
arcs are used to represent the relations between the
nodes. Arcs build a kind of hierarchies in the Knowledge
Base. Arcs usually represent relations like is_a or
has_part.
Semantic networks are useful in representation of
sentences of natural language.

Semantics is the linguistic and philosophical study
of meaning, in language, programming languages,
formal logics, and semiotics.
It is concerned with the relationship between signifiers—
like words, phrases, signs, and symbols—and what they
stand for in reality, their denotation.

In LISP Programming Language:
(setq *database*
'((canary (is-a bird)
(color yellow)
(size small))
(penguin (is-a bird)
(movement swim))
(bird (is-a vertebrate)
(has-part wings)
(reproduction egg-laying))))

Also, setq can be used to assign different values to different
variables. The first argument is bound to the value of the
second argument, the third argument is bound to the
value of the fourth argument, and so on. For example,
you could use the following to assign a list of trees to the
symbol trees and a list of herbivores to the
symbol herbivores:
(setq trees '(pine fir oak maple)
herbivores '(gazelle antelope zebra))

To set the value of the variable carnivores to the
list '(lion tiger leopard) using setq, the following
expression is used:
(setq carnivores '(lion tiger leopard))
This is exactly the same as using set except the first
argument is automatically quoted by setq. (The ‘q’
in setq means quote.)
With set, the expression would look like this:
(set 'carnivores '(lion tiger leopard))

Complexity in Semantic Network Knowledge Representation
The idea of semantic networks started out as a natural way to
represent labelled connections between entities. But, as the
representations are expected to support increasingly large
ranges of problem solving tasks, the representation schemes
necessarily become increasingly complex
In particular, it becomes necessary to assign more structure to
nodes, as well as to links. For example, in many cases we need
node labels that can be computed, rather than being fixed in
advance. It is natural to use database ideas to keep track of
everything, and the nodes and their relations begin to look
more like frames.

b) Frame Based Knowledge Representation
The original idea of frames was developed by Minsky
(1975) who defined them as “data structures for
representing stereotyped situations”, such as going into
a class room.
It is an object-oriented approach. A frame represents an
object (document or information) or class of objects
(collection of documents or information) or several facts.
When they represent a class of objects, they generalize
certain groups identifying overall properties of those
groups, it shares.

The pointers where properties are stored are known as
slots. Similarly, if frame represents an object, slots
represent the properties or attributes of the object.
Slots contain value for that particular attribute.
For example, a book in a library is an object, therefore it
can be represented as frame. The properties of book,
i.e., Title, Author, Place, Publisher and so on are stored
as slots and each slot would have corresponding value.

Frame:
Book
Slots:
Title
Author
Publisher
Place
Size
Value:
Information Storage & Retrieval
G. G. Chaudhury
Ess Ess Publication
New Delhi
18 X 14 cm

The simplest type of frame is just a data structure with
similar properties and possibilities for knowledge
representation as a semantic network, with the same
ideas of inheritance and default values
Frames become much more powerful when their slots can
also contain instructions (procedures) for computing
things from information in other slots or in other frames

Class Room
is-a: Room
Location: Department
Contains: {Desk, Bench,
Black Board,
Table, Chairs..}
:
Class Room Chair
Is a: Chair
Location: Class Room
Height: 20-40cm
Legs: 4
Comfortable: Yes
Use: Sitting
Basic Idea: A frame consists of a selection of slots which
can be filled by values, or procedures for calculating
values, or pointers to other frames. For example:

This type of frames are now generally referred to as Scripts.
Attached to each frame will then be several kinds of
information. Some information can be about how to use
the frame. Some can be about what one can expect to
happen next, or what one should do next. Some can be
about what to do if our expectations are not confirmed.
Then, when one encounters a new situation, one can
select from memory an appropriate frame and this can be
adapted to fit reality by changing particular details as
necessary
A complete frame based representation will consist of a
whole hierarchy or network of frames connected
together by appropriate links/pointers

c) Rule-Based Knowledge Representation
Rule based representation is a popular approach. Rules are
employed to state the way in which the inference has to
be done.
Rules provide a formal way of representing recommendations,
directives, or strategies. Rules are appropriate when the
domain knowledge results from empirical associations
developed through years of experience in solving problems
in a given area.

Rules are expressed in the form of IF-THEN statements.
For example:
If search is in collection of BOOKS THEN display Title,
Author, Place, Publisher, Year, Physical Description, ISBN
If search is in collection of ARTICLES THEN display Title,
Author, Name of Journal, Volume, Issue, Year, ISSN
Rules – antecedent clause (condition) related to a
consequent clause Formalisms (action) by implication if
(A and B) THEN S1

The syntax structure is
IF <premise>THEN<action>
<premise>– is Boolean. The AND, and to a lesser
degree OR and NOT, logical connectives are
possible.
<action>– a series of statements

In a rule based expert system, the domain knowledge is
represented as a set of rules that are checked against a
collection of facts or knowledge about the current
situation.
When the IF portion of the rule is satisfied by the facts, the
action specified by the THEN portion is performed. When
the condition is satisfied the rule is said to ‘fire’ or
‘execute’. A rule interpreter is used to compare the IF
portions of rules with the facts and execute the rule
whose IF portion matches the facts.
This is a real success story of AI – tens of thousands of
working systems deployed into many aspects of life

Normally, the term 'rule-based system' is applied to systems
involving human-crafted or curated rule sets. Rule-based
systems constructed using automatic rule inference, such
as rule-based machine learning, are normally excluded from
this system type
Rule-based systems are used as a way to store and manipulate
knowledge to interpret information in a useful way. They are
often used in artificial intelligence applications and research.
A rule-base system (or production system) is a KBS in which
the knowledge is stored as rules; an expert system is a
RBSs in which the rules come from human experts in a
particular domain

B. Indexing and Clustering
Indexing
An index or database index is a data structure which is used
to quickly locate and access the data in a database table.
Indexing is a way to optimize performance of a database by
minimizing the number of disk accesses required when a
query is processed.

Indexes are created using some database columns:
• The first column is the Search key that contains a copy of
the primary key or candidate key of the table. These values
are stored in sorted order so that the corresponding data
can be accessed quickly (Note that the data may or may
not be stored in sorted order).
• The second column is the Data Reference which contains a
set of pointers holding the address of the disk block where
that particular key value can be found.

Clustered Indexing
• Clustering index is defined on an ordered data file. The data
file is ordered on a non-key field. In some cases, the index is
created on non-primary key columns which may not be
unique for each record. In such cases, in order to identify
the records faster, we will group two or more columns
together to get the unique values and create index out of
them. This method is known as clustering index.
• Basically, records with similar characteristics are grouped
together and indexes are created for these groups.
• For example below, students studying in each semester are
grouped together. i.e. 1st Semester students, 2nd semester
students, 3rd semester students etc are grouped.

III. Search Agent
Search Agents are vital components of any ISAR system.
These are basically programs which takes input from
Search Interface and searches in the Knowledge Base
using existing index. A good ISAR system means efficient
retrieval. Thus, a good search agent must be equipped
with following features:
facility of using Boolean operators
context setting to search terms
use of clustering algorithms
use of phonetic algorithms
(soundex and metaphone algorithms)

Boolean Operators
Boolean Operators are simple words (AND, OR, NOT or AND
NOT) used as conjunctions to combine or exclude keywords
in a search, resulting in more focused and productive
results.
AND and NOT operators increase precision whereas OR
increases recall of search results. The shaded area in the
diagram represents retrieved records in the following
example.

Using these operators can greatly reduce or expand the
amount of records returned.
Boolean operators are useful in saving time by focusing
searches for more 'on-target' results that are more
appropriate to your needs, eliminating unsuitable or
inappropriate.
Each search engine or database collection uses Boolean
operators in a slightly different way or may require the
operator be typed in capitals or have special punctuation.
The specific phrasing will be found in either the guide to
the specific database found in Research Resources or the
search engine's help screens.

AND—requires both terms to be in each item returned. If
one term is contained in the document and the other is
not, the item is not included in the resulting list.
(Narrows the search)
Example: A search on stock market AND trading includes
results contains: stock market trading; trading on the
stock market; and trading on the late afternoon stock
market

OR—either term (or both) will be in the returned
document. (Broadens the search)
Example: A search on ecology OR pollution includes results
contains: documents containing the world ecology (but
not pollution) and other documents containing the word
pollution (but not ecology) as well as documents with
ecology and pollution in either order or number of uses.

NOT or AND NOT ( dependent upon the coding of the
database's search engine)—the first term is searched,
then any records containing the term after the operators
are subtracted from the results. (Be careful with use as
the attempt to narrow the search may be too exclusive
and eliminate good records). If you need to search the
word not, that can usually be done by placing double
quotes (<< >>) around it.
Example: A search on Mexico AND NOT city includes results
contains: New Mexico; the nation of Mexico; US-Mexico
trade; but does not return Mexico City or This city's
trade relationships with Mexico.

Using Parentheses—Using the ( ) to enclose search
strategies will customize your results to more accurately
reflect your topic. Search engines deal with search
statements within the parentheses first, then apply any
statements that are not enclosed.
Example: A search on (smoking or tobacco) and cancer
returns articles containing: smoking and cancer; tobacco
and cancer smoking; cancer, and tobacco; but does not
return smoking or tobacco when cancer is not
mentioned.

Context Setting
Context Setting requires content analysis of document.
Here one analyses document manually or automatically
in order to preserve the context of each term in the
index.
It can be done in two ways:
i. Conceptual Analysis
ii. Relational Analysis.

Conceptual analysis
Conceptual analysis can be thought of as frequency of
concepts. Concept can be represented by texts as well
as pictures. To analyze the concept one looks for the
appearance of words in the text. It is not necessary that
same word appears always, there may be synonymous
terms present.
For example, if one is analyzing a certain document is
about freedom then one should look for the related
words like liberation, independence, etc.

Relational analysis
Relational analysis goes one step further by examining the
relationships among concepts in a text. In relational
analysis we look for what are the related words
appearing next to the word in question.
For example, to see what are the words that appear next to
freedom and then determine the related concepts.
Freedom:
i. Freedom of speech and expression: Article 19 (1) (a) of
Constitution of India, Fundamental Rights & duties, ….
ii. Freedom of opinion and Expression: article 19 of UN
Universal declaration of Human Rights, Citizen’s
responsibility,….

Clustering Algorithms
Clustering is one of the most common exploratory data analysis
technique used to get an intuition about the structure of the
data. It can be defined as the task of identifying subgroups in
the data such that data points in the same subgroup (cluster)
are very similar while data points in different clusters are
very different.
Clustering is a method by which large sets of data is grouped
into groups or clusters of smaller sets of similar data based
on some characteristics.
A cluster refers to a collection of data points aggregated
together because of certain similarities.
For example, in a group of players one can cluster players
according to their specialisation of game, like those who play
cricket, those who play hockey and so on.

A clustering algorithm attempts to identify natural groups
of components or data based on some similarity in a
given population. In other words, it is a method to
create subclass in a given class. The first thing in such
algorithms are identification of core entity which is also
known as centroid.
A centroid is the imaginary or real location representing
the center of the cluster. Around centroid similar kind
of entities are identified.
In a clustering algorithm, our final goal is to represent this
unordered data in an organized way, and divide it into
clusters.

K-means Algorithm
K-means algorithm is an algorithm that tries to partition the
dataset into K-pre-defined distinct non-overlapping
subgroups (clusters) where each data point belongs to only
one group. It tries to make the inter-cluster data points as
similar as possible while also keeping the clusters as different
(far) as possible.
It assigns data points to a cluster such that the sum of the
squared distance between the data points and the cluster’s
centroid (arithmetic mean of all the data points that belong
to that cluster) is at the minimum.
The less variation we have within clusters, the more
homogeneous (similar) the data points are within the same
cluster.

K-Means Clustering
K-means algorithm identifies k number of centroids, and then
allocates every data point to the nearest cluster, while keeping the
centroids as small as possible. The ‘means’ in the K-means refers to
averaging of the data; that is, finding the centroid.

Mean Shift Clustering Algorithm
Mean Shift clustering algorithm is an unsupervised clustering
algorithm that groups data directly without being trained on
labelled data. The nature of the Mean Shift clustering
algorithm is hierarchical in nature, which means it builds on a
hierarchy of clusters, step by step.
Mean Shift essentially starts off with a kernel, which is basically
a circular sliding window. The bandwidth, i.e. the radius of
this sliding window will be pre-decided by the user.

A very high level view of the algorithm can be of :
STEP 1: Pick any random point, and place the window on that
data point.
STEP 2: Calculate the mean of all the points lying inside this
window.
STEP 3: Shift the window, such that it is lying on the location of
the mean.
STEP 4: Repeat till convergence
Mean shift clustering aims to discover “blobs” in a
smooth density of samples. It is a centroid-based
algorithm, which works by updating candidates for
centroids to be the mean of the points within a given
region. These candidates are then filtered in a post-
processing stage to eliminate near-duplicates to form
the final set of centroids

Mean-Shift Clustering: in a single window
What we're trying to achieve here is, to keep shifting the
window to a region of higher density. This is why, we keep
shifting the window towards the centroid of all the points in
the window. This feature of Mean Shift algorithm describes it's
property as a hill climb algorithm

Mean-Shift Clustering: entire process

Density-Based Spatial Clustering

Expectation–Maximization (EM) Clustering using Gaussian
Mixture Models (GMM)

Agglomerative Hierarchical Clustering

Phonetic algorithm
• A phonetic algorithm is a
algorithm for indexing of words by their pronunciation.
Most phonetic algorithms were developed for use with
the English language; consequently, applying the rules to
words in other languages might not give a meaningful
result.
• They are necessarily complex algorithms with many rules
and exceptions, because English spelling and
pronunciation is complicated by historical changes in
pronunciation and words borrowed from
many languages.

Best Known phonetic Algorithms:
i. Metaphone Algorithm (Metaphone, Double
Metaphone, and Metaphone 3)
ii. Soundex
iii. Daitch–Mokotoff Soundex
iv. Cologne phonetics
v. New York State Identification and Intelligence
System (NYSIIS)
vi. Match Rating Approach
vii. Caverphone

Metaphone is an algorithm which encodes pronunciation of
a word letter-by-letter basis, it encodes groups of letters
i.e. a word. Metaphone embodies more accurately the
rules of pronunciation in language. Such algorithms are
well established for English as a language. Both
algorithms return all the words that exactly match the
desired word as well as all similar sounding names.
Metaphone has attained different versions in its
development, like, Double Metaphone , Metaphone 3
etc, depending on its accuracy of spelling check.

Soundex is a phonetic algorithm for indexing names by
sound, as pronounced in English. The goal is
for homophones to be encoded to the same
representation so that they can be matched despite
minor differences in spelling.
Soundex and metaphone algorithms are almost the same
kind of algorithm. Both these algorithms are based in the
way pronunciation of a word is made. In soundex
algorithm, a numeric code is assigned to each character
used in a word and when search is performed, words
with similar codes are also brought out in search result.

Soundex is the most widely known of all phonetic
algorithms is a standard feature of popular database
software such as DB2, PostgreSQL, MySQL,
SQLite, Ingres, MS SQL Server and Oracle) and is often
used (incorrectly) as a synonym for "phonetic
algorithm".[

Common uses
• Spell checkers can often contain phonetic algorithms.
The Metaphone algorithm, for example, can take an incorrectly
spelled word and create a code. The code is then looked up in
directory for words with the same or similar Metaphone. Words
that have the same or similar Metaphone become possible
alternative spellings.
• Search functionality will often use phonetic algorithms to find
results that don't match exactly the term(s) used in the search.
Searching for names can be difficult as there are often multiple
alternative spellings for names.
An example is the name Claire. It has two alternatives, Clare/Clair,
which are both pronounced the same. Searching for one spelling
wouldn't show results for the two others. Using Soundex all
three variations produce the same Soundex code, C460. By
searching names based on the Soundex code all three variations
will be returned.

Evaluation of ISAR systems
Evaluation is a systematic determination of a subject's
merit, worth and significance, using criteria governed by
a set of standards.
It can assist an organization, program, project or any
other intervention or initiative to assess any aim,
realisable concept/proposal, or any alternative, to help
in decision making; or to ascertain the degree of
achievement or value in regard to the aim and objectives
and results of any such action that has been completed.

Evaluation is the structured interpretation and giving of
meaning to predict or actual impacts of proposals or
results. It looks at original objectives, and at what are
either predicted or what was accomplished and how it
was accomplished.
So evaluation can be formative that is taking place during
the development of a concept or proposal, project or
organization, with the intention of improving the value or
effectiveness of the proposal, project, or organization. It
can also be summative, drawing lessons from a
completed action or project or an organization at a later
point in time or circumstance

Evaluation is inherently a theoretically informed approach
and consequently any particular definition of evaluation
would have be tailored to its context - the theory,
approach, needs, purpose, and methodology of the
evaluation process itself.
A systematic, rigorous, and meticulous application of
scientific methods to assess the design, implementation,
improvement, or outcomes of a program. It is a resource-
intensive process, frequently requiring resources, such
as, evaluator expertise, labour, time, and a sizeable
budget.

Evaluation of information retrieval system measure
which of the two existing system perform better
and try to assess how the level of performance of
a given can be improved, i.e. it measures two
parameters:
i. Effectiveness
ii. Efficiency

By effectiveness it means the level up to which the given
system attained its objectives.
Thus in information retrieval system effectiveness may be
measure of how far it can retrieve relevant information
accurately while withholding non-relevant information.
A search engine that is extremely fast is of no use unless it
produces good results.

Efficiency means how economically the system is
achieving its objectives.
In an information retrieval system efficiency can be
measured be factor such as cost. The cost factors are
to be calculated indirectly. They include factor such
as response time, time taken by the system to
provide an answer. User effort, the amount of time
and effort needed by a user to interact with the
system and analysed the output retrieved in order to
get the correct information.

Lancaster state that evaluation of information
retrieval system can be justified by the following
three issues:
1. How well the system is satisfying its objectives
2. How efficiently it is satisfying its objectives and
3. Whether the system justified its existence.

PURPOSE OF EVALUATION
Swanson state seven purposes for evaluation:
1. To assess a set of goals, a programme plan, or a design prior to
implementation.
2. To determine whether and how well goals or performance
expectation are being fulfilled.
3. To determine specific reasons for success and failure.
4. To uncover principles underlying a successful programme.
5. To explore technique for increasing programme effectiveness.
6. To established a foundation of further research on the reason
for the relative success of alternative technique and
7. To improve the means employed for attaining objectives or to
redefine sub goals or goals in view of research findings

Keen give three major purpose of evaluation for an
information retrieval system:
1. The need for measures with which to make merit
comparisons within a single test situation. In other
words, evaluation studies are conducted to compare
the merits or demerits of two or more system
2. The need for measure with which to make comparison
between results obtained in different test situation
3. The need for assessing the merit of a real-life system.

EVALUATION CRITERIA FOR ISRS
Evaluation of Information Retrieval is conduct into
two different viewpoints.
1. Managerial view: when evaluation is conducted
from managerial point of view it is called
managerial oriented evaluation.
2. User view: when evaluation is conducted from
the user point of view it is called user-oriented
evaluation study.

Criteria for evaluation of ISRS (Managerial view)
Lancaster in 1971 proposed five evaluation criteria:
1. Coverage of the system
2. Ability of the system to retrieve wanted items
(i.e. recall)
3. Ability of the system to avoid retrieval of
unwanted items (i.e. precision)
4. The response time of the system, and
5. The amount of effort required by the user

Vickery advocate six criteria for evaluation of ISRS
and grouped into two sets as follows:
Set 1
1. Coverage- the proportion of the total potentially useful
literature that has been analyzed.
2. Recall- the proportion of such references that are
retrieved in a search, and
3. Response time- the average time needed to obtain a
response from the system.

Set 2
4. Precision- the ability of the system to screen out
irrelevant references
5. Usability- the value of the references retrieved, in terms
of such factors as their reliability, comprehensibility,
currency and
6. Presentation- the form in which search results are
presented to the user.

Cleverdon (1966) identified six criteria for the evaluation of
ISRS:
1. Recall- the ability of the system to present all the
relevant items.
2. Precision- the ability of the system to present only those
items that is relevant.
3. Time lag- the average interval between the time the
search request is made and the time an answer is
provided.
4. Effort- intellectual as well as physical required from the
user in obtaining answer to the search request.
5. Form of presentation- search output, which effects the
user ability to make use of the relevant items and
6. Coverage of the collection- the extent to which the
system includes relevant matter.

Criteria for evaluation of ISRS (User-Centred Evaluation)
User base evaluation is the most common
evaluation system advocated by many
information scientists. A criterion for evaluation
of information retrieval system includes:
1. Recall
2. Precision
3. Fallout
4. Generality

The user centred approach examines the information
seeking task in the context of human behaviour in
order to understand more completely the nature of
user interaction with an information system.
User centred evaluation is based on the premise that
understanding user behaviour facilitates more effective
system design.
These studies examine the user from a behavioural
science perspective using methods common to
psychology, sociology, and anthropology.

While examining user centered approaches two
methods can be applied:
Qualitative method of evaluation
Quantitative method evaluation

Qualitative method of evaluation
Qualitative methods of evaluation such as case studies,
focus groups or in-depth interviews can be combined
with objective measures to produce more effective
information retrieval research and evaluation.
Quantitative method evaluation
In Quantitative method evaluation empirical methods
such as experimentation are frequently employed to
observe and probe subjective and affective factors
quantitatively.

According to Saracevic & Kantor (1988), the key to the
future of information systems and searching processes
lies not in increased sophistication of technology, but
in increased understanding of human involvement
with information.
Therefore, there has been an increased interest in
qualitative methods that capture the complexity and
diversity of human experience in information storage
and retrieval system and its process.

Recall
The term recall refers to a measure of whether a particular
item is retrieved or the extent to which the retrieval of
wanted items occurs.
Recall is defined as the proportion of the total relevant
documents that is retrieved out of total relevant
document stored in the collection.

Whenever a user puts his/her query, it is the responsibility
if the system to retrieve all those items that is relevant to
the given query. When the collection is large it is not
possible to retrieve all the relevant items. Thus, a system
is able to retrieve a proportion of the total relevant
document in response to a given query.
The performance of a system is often measured by recall
ratio, which denotes the percentages of relevant items
retrieved in a given situation.

The general formula for calculation of recall may be state
as:
Number of relevant item retrieved
Recall=——————————————————————-- x 100
Total number of relevant items in the collection

Example, if there are 100 documents in a collection that
are relevant to a given query and 60 of these items
are retrieved in a given search, then the recall is
state to be 60%.
Recall=——————————————————————-- x 100
Total number of relevant items in the collection
60
Recall = ——————----- x 100
100
= 60%
In other words the system has been able to retrieve 60%
of the relevant items.

Precision
By precision we mean how precisely a particular system
function. Precision is defined as the proportion of
documents retrieved that is relevant out of total number
retrieved documents.
In precision the non-relevant items is discarded by the user.
The general formula for calculation of precision may be
state as:
Precision=———————————————————x 100
Total number of items retrieved

Example, if in a given search the system retrieves
80 items, out of which 60 are relevant and 20 are
non-relevant, the precision is 75%.
Precision=———————————————————x 100
Total number of items retrieved
60
Precision = ——————x 100
80
= 75%

Recall-precision matrix
The recall is related to the ability of the system to retrieve
relevant documents, and precision related to its ability
not to retrieve non-relevant documents.
The ideal system attempts to achieve 100% recall and
100% precision is not possible in practice, because as
the level of recall increase precision tends to decrease.
According to Lancaster recall and precision tend to vary
inversely.

Following example show the relationship between recall
and precision of a given search:
In a given situation a system:
i. retrieved a+b number of documents, out of which,
ii. a documents are relevant, and
iii. b documents are non-relevant (but retrieved).
iv. c+d document are left in the collection after
the search has been conducted.
v. Out of the c+d number, c document are relevant
to the query but could not be retrieved, and
vi. d document are not relevant (and not retrieved)
and thus have been correctly rejected.

Recall-precision matrix
Relevant Not-Relevant Total
Retrieved a (Hits) b (Noise) a +b
Not-Retrieved c (Misses) d (Rejected) c + d
Total a + c b + d a + b+ c + d
Lancaster suggests that these statistics can be represented
in a 2 x 2 matrix, as shown below:

The system retrieves a relevant document along with b
non-relevant documents.
Thus following Lancaster it can be stated that a denoted
hits and b denotes the noise. Now out of the remaining
c+d document, the system misses c document that
should have been retrieved, but it correctly rejected d
document that are not to the given query. The recall
and precision ratio in this case can be calculated as
R= [a/ (a+c)] x 100
P= [a/ (a+b)] x 100

The value of recall can be increase by increasing the
value of a, that is by retrieving a greater number of
relevant items. This can be achieved by increasing the
number of retrieved document, but as the number
of items retrieved increases, so also increase the
likelihood of retrieval of non -relevant items that is b,
which decreases the value of precision. Lancaster
therefore states that recall and precision tend to vary
inversely. In a retrieval environment when we want to
retrieve more relevant items, we generally broaden our
search

The relationship between recall and precision can be
examine by considering searches held at different
levels with the same set of documents and request.
Beginning with very general search terms high recall and
low precision can be achieved, and as the search terms
becomes more and more specific recall tends to go
down and precision tends to go up.
In real -life situation, user normally does not want very
high recall. In general most users want a few documents
in response to a query, meaning a moderate level of
recall.

Limitations of recall and precision
i. Difference in the level of precision and accuracy:
Different users may want different levels of recall. A
person going to prepare a state-of-the-art report on a
topic would like to have all the items available on the
topics and therefore will go for high recall. Whereas, a
user wanting to know about a given topic will prefer to
have a few items and thus will not require a high
recall.

ii. Difference in judgment on degree of relevance
Another drawback of recall is that it assumes that
all relevant items have the same value, which is
not true. The retrieved items may have different
degree of relevance and this may vary from user to
user, and even form time to time to the same user.
Both recall and precision depend largely on the
relevance judgment of the user

iii. Measures for system performance not for relevance
judgment
Despite their apparent simplicity, these are slippery
concepts, depending for their definition on relevance
judgments which are subjective at best. Because these
criteria are document-based, they measure only the
performance of the system in retrieving items to the
information need. They do not consider how the information
will be used, or whether, in the judgment of the user, the
documents fulfill the information need.
These limitations of precision and recall have been
acknowledged and the need for additional measures and
different criteria for effectiveness has been identified.

Fallout
Fallout ratio is the proportion of non-relevant
items that has been retrieved out of all non-
relevant documents available in a given search
No. of Retrieved Non Relevant document
Fallout = ----------------------------------------- ----------------------x 100
Total No. of Non Relevant document

Generality
Generality ratio is the proportion of relevant items
(retrieved & non retrieved) in a given search
No. of Relevant document
Generality = ----------------------------------------- ------------x 100
Total No. of document

Retrieval Measure
SYMBOL EVALUATION
MEASURE
FORMULA EXPLANATION
R RECALL a/ (a + c) Proportion of relevant items
retrieved
P PRECESSION a/ (a + b) Proportion of retrieved items that
are relevant
F FALLOUT b/ (b + d) Proportion of non-relevant items
retrieved
G GENERALITY (a + c)/
(a+b+c+d)
Proportion of relevant items per
query

Assessment of Evaluation criteria
Different stakeholders, such as information professionals,
systems designers and users, may have different need and
expectations of an IR system and accordingly objectives,
decision, process, design or action of an IR system are set.
Evaluation is a process whose main purpose is to assess
whether the IR system is working what it is expected to
do. These assessment are done by measuring the features
such as Recall, Precision, Fallout and Generality ratio. The
analysis of results of these features determines the
performance level of the IR system in respect to the following
:
Effectiveness
Usability
Satisfaction
Cost

Effectiveness
Effectiveness is the system’s ability or success to retrieve
relevant information which meet the needs of the user.
The two most commonly used measures of system
performance are the recall ratio and the precision ratio
Relevant Not relevant
Retrieved A B
Not retrieved C D
Totals A + C B + D

The search results in the Table above may have four possible outcome:
1. Relevant documents successfully retrieved – A (hits)
2. Non-relevant documents retrieved- B (noise)
3. Relevant documents failed to retrieve - C (miss)
4. Non-relevant document not retrieved and successfully dodged -
D
total relevant retrieved A
Recall= ----------------------------- x 100 = ---------- x 100 = system’s ability
total relevant in system (A + C) to retrieve relevant
Information/Doc
total relevant retrieved A
Precision = ----------------------------- x 100 = ---------- x 100 = system’s ability
total retrieved (A + B) to suppress irrelevant
Information/Doc

total irrelevant retrieved B
Fallout = ----------------------------- x 100 = --------x 100 = system’s ability to
total irrelevant (B+ D) suppress irrelevant
Information/Doc
Thus, assessment of all the above factors, i.e. Recall, Precision &
Fallout actually measures effectiveness of an IR system. Indexing
systems and search software should be designed to
maximize both recall and precision, that is, in other words to
minimize noise and misses.
It may be difficult to measure the total number of relevant
document in an IRS. Because it involves examining every
document in the system for its potential relevance to a specific
search query. For web search engines such as Google this is
clearly impossible

Usability
Usability is part of the broader term “user experience”
and refers to the ease of access and/or use of a product
or website.
A design is not usable or unusable per-se; its features,
together with the context of the user (what the user
wants to do with it and the user’s environment),
determine its level of usability.
The official ISO 9241-11 definition of usability is: “the
extent to which a product can be used by specified users
to achieve specified goals with effectiveness, efficiency
and satisfaction in a specified context of use.”

A usable interface has three main outcomes:
• It should be easy for the user to become familiar with and
competent in using the user interface during the first contact
with the website.
• It should be easy for users to achieve their objective through
using the website. If a user has the goal of booking a flight, a
good design will guide him/her through the easiest process
to purchase that ticket.
• It should be easy to recall the user interface and how to use
it on subsequent visits. So, a good design on the travel
agent’s site means the user should learn from the first time
and book a second ticket just as easily.
Usability is what determines whether a design’s existing
attributes make it stand or fall

Satisfaction
There is no agreed definition of user satisfaction within the information
science and information system communities. User satisfaction is a
subjective variable, which can be influenced by several factors such
as system effectiveness, user effectiveness, user effort, and user
characteristics and expectations. Therefore, information retrieval
evaluators should consider all these factors in obtaining user
satisfaction and in using it as a criterion of system effectiveness.
Applegate outlines three different models of searcher satisfaction
namely:
The material satisfaction model
The emotional satisfaction- simple path model
The emotional satisfaction- multiple path model

Search result would be an appropriate measure of the
material satisfaction model. Both emotional satisfaction
models are based upon subjective impressions and
assessments which may be affected by factors such as:
Search task
Search setting
The searcher’s ability, quality & judgment in
digital environments
Service quality
website quality
Literatures used

Cost
Users may experience costs in terms of any payment that
they need to make for system or document access but
the most significant cost is associated with the time that
they expend in searching a system.
Search algorithm, the options for the display of hits, the
seamlessness of the stages in individual systems and
interoperability between systems are important factors
to satisfy an users regardless of materialistic cost.

Information Retrieval Models
An Information Retrieval Model is nothing but a framework
of action process or method of matching information
need and retrieval of information from databases,
knowledge bases and information systems
The goal of information retrieval (IR) is to provide users
with those documents that will satisfy their information
need. We use the word "document" as a general term
that could also include non-textual information, such as
multimedia objects.

According to Marcus (1994) & Marchionini (1992) Information
seeking is a form of problem solving mechanism. It proceeds
according to the interaction among eight sub processes:
i. problem recognition and acceptance,
ii. problem definition,
iii. search system selection,
iv. query formulation,
v. query execution,
vi. examination of results (including relevance feedback),
vii. information extraction, and
viii. reflection/iteration/termination.
Again, To be able to perform effective searches, users have to
develop the following expertise:
i. knowledge about various sources of information,
ii. skills in defining search problems and applying search strategies,
iii. competence in using electronic search tools.

a general overview of the information retrieval process, which has
been adapted from Lancaster and Warner (1993).

The Figure above represents a general model of the
information retrieval process, where both the user's
information need and the document collection have
to be translated into the form of surrogates to enable
the matching process to be performed. This figure
has been adapted from Lancaster and Warner
(1993).

How a general IR Model works
1. Users have to formulate their information need in a
form that can be understood by the retrieval
mechanism
2. Likewise, the contents of large document collections
need to be described in a form that allows the retrieval
mechanism to identify the potentially relevant
documents quickly.
(In both cases, information may be lost in the
transformation process leading to a computer-usable
representation. Hence, the matching process is
inherently imperfect)

3. Once the specified query has been executed by IR system, a
user is presented with the retrieved document surrogates
4. Either the user is satisfied by the retrieved information or he
will evaluate the retrieved documents and modify the query
to initiate a further search. The process of query
modification based on user evaluation of the retrieved
documents is known as relevance feedback.
(Information retrieval is an inherently interactive process, and the
users can change direction by modifying the query surrogate, the
conceptual query or their understanding of their information
need)
5. The results, which have been obtained in studies
investigating the information-seeking process, that describe
information retrieval in terms of the cognitive and affective
symptoms commonly experienced by a library user.

How a general IR Model works
1. Users have to formulate their information need in a form that can be
understood by the retrieval mechanism.
The information need can be understood as forming a pyramid, where only
its peak is made visible by users in the form of a conceptual query. The
conceptual query captures the key concepts and the relationships
among them. It is the result of a conceptual analysis that operates on
the information need, which may be well or vaguely defined in the
user's mind. This analysis can be challenging, because users are faced
with the general "vocabulary problem" as they are trying to translate
their information need into a conceptual query. This problem refers to
the fact that a single word can have more than one meaning, and,
conversely, the same concept can be described by surprisingly many
different words. Further, the concepts used to represent the documents
can be different from the concepts used by the user. The conceptual
query can take the form of a natural language statement, a list of
concepts that can have degrees of importance assigned to them, or it
can be statement that coordinates the concepts using Boolean
operators. Finally, the conceptual query has to be translated into a query
surrogate that can be understood by the retrieval system.

2. Likewise, the contents of large document collections
need to be described in a form that allows the retrieval
mechanism to identify the potentially relevant
documents quickly.
Similarly as the point No.1, the meanings of documents
need to be represented in the form of text surrogates
that can be processed by computer. A typical surrogate
can consist of a set of index terms or descriptors. The
text surrogate can consist of multiple fields, such as the
title, abstract, descriptor fields to capture the meaning
of a document at different levels of resolution or
focusing on different characteristic aspects of a
document.

3. Once the specified query has been executed by IR
system, a user is presented with the retrieved document
surrogates
(i.e. A typical document surrogate can consist of a set of
index terms or descriptors. The text surrogate can
consist of multiple fields, such as the title, abstract,
descriptor fields to capture the meaning of a document)

4. Either the user is satisfied by the retrieved
information or he will evaluate the retrieved
documents and modify the query to initiate a further
search. The process of query modification based on
user evaluation of the retrieved documents is known
as relevance feedback.
Information retrieval is an inherently interactive
process, and the users can change direction by
modifying the query surrogate, the conceptual query
or their understanding of their information need

5. The results, which have been obtained in studies investigating the
information-seeking process, that describe information retrieval in
terms of the cognitive and affective symptoms commonly
experienced by a library user.
Cognitive syndrome like uncertainty, confusion, and frustration are
nearly universal experiences in the early stages of the search
process, and they decrease as the search process progresses and
feelings of being confident, satisfied, sure and relieved increase.
The studies also indicate that cognitive attributes may affect the
search process. User's expectations of the information system and
the search process may influence the way they approach
searching and therefore affect the intellectual access to
information.
The findings by Kuhlthau et al. (1990) indicate that thoughts about
the information need become clearer and more focused as users
move through the search process.

Search or Browsing?
The conceptual query can take the form of a natural
language statement, a list of concepts that can have
degrees of importance assigned to them, or it can be a
statement that coordinates the concepts using Boolean
operators. Finally, the conceptual query has to be
translated into a query surrogate that can be understood
by the retrieval system.
Analytical search strategies require the formulation of
specific, well-structured queries and a systematic,
iterative search for information.
Browsing involves the generation of broad query terms and
a scanning of much larger sets of information in a
relatively unstructured fashion.

Campagnoni et al. (1989) have found in information
retrieval studies in hypertext systems that the
predominant search strategy is "browsing" rather than
"analytical search".
Many users, especially novices, are unwilling or unable to
precisely formulate their search objectives.
Browsing places less cognitive load on them. Furthermore,
research showed that search strategy is only one
dimension of effective information retrieval

Irrespective of any retrieval environment, the following four
main system components must be taken into account in
formulation of the retrieval problem.
a) The objects, documents, or records themselves (which in
the aggregate constitute the information files to be
processed);
b) The information identifiers, terms, index terms, keywords,
attributes, etc. (which characterise the records or
documents and represent the information content in each
case);
c) The information requests (which enter into the system and
are to be compared with the stored records for retrieval);
and
d) The relevance information (often supplied by the users of
the system connecting the information requests to the
stored information items).

MODELS BASED ON INPUT/OUTPUT
On the basis of input and the output, Information
Retrieval Models can be grouped into three basic
categories:
i) Data Retrieval Model
ii) Information Retrieval Model
iii) Knowledge Retrieval Model.

i) Data Retrieval Model
Data retrieval model essentially handles data which may be
taken as unprocessed information or preliminary phase
of information.
Data is an unbiased fact which can be used to form an
information. Here, the expression of information need
should be very precise. For example, population data,
day to day temperature, daily rainfall, transaction
status at ATM, etc.
The data retrieval model is a simple model of information
retrieval needing specific matching techniques.

2. Information Retrieval Model
Information Retrieval Model actually combines several data
into a relational structure of information. Therefore,
relatively it is a more complex model in comparison to
Data Retrieval Model as because It has to comprehend
multi-dimensional relationships amongst data.
It is not amenable easily to a taxonomic structure. The
representation of information is to be based on a
relational data base structure using some associative
mathematics.
The expression of information need is also complex and
time consuming. It draws out for a long conversational or
browsing process and the information retrieval model
must incorporate such facilities and interfaces.

3. Knowledge Retrieval Model
Knowledge is a kind of integration of general types of
information. It normally occurs in the human mind. The
human mind infers and integrates several coordinates
with the information received by it.
So, knowledge is assimilated information. In order to
facilitate decision-making and problem solving,
intelligent knowledge based information retrieval
models are coming up. Such systems comprise three
basic aspects:
i. knowledge base, ii. inference engine, iii. user interface

a) The so-called knowledge base or a store of accumulated
set of rules for converting information into knowledge. It
also incorporates knowledge acquisition system.
b) The second aspect of the system is inference engine. An
inference engine is capable of deriving appropriate
information from the combination of rules for deriving a
synthesized knowledge. This process of deriving is based
on inferential logic using quantitative and non-quantitative
techniques.
c) A user interface, i.e., conversational process in the model
which is capable of receiving information in the
conversation mode and converting it into database signals
for interaction purposes. Thus, a knowledge retrieval
model is a sophisticated model of information processing,
organization and retrieval.

MAJOR IR MODELS
(BASED ON THEORIES AND TOOLS)
1. Boolean Retrieval
1.1 Standard Boolean
1.2 Narrowing and Broadening Techniques
1.3 Smart Boolean Models
1.4 Extended Boolean Models
2. Statistical Model
2.1 Vector Space Model
2.2 Probabilistic Model
2.3 Latent Semantic Indexing
3. Linguistic and Knowledge-based Approaches
3.1 DR-LINK Retrieval System

1.1 Standard Boolean
Boolean logic allows a user to logically relate multiple
concepts together to define what information is needed.
The typical Boolean operators are AND, OR, and NOT.
These operations are implemented using set
intersection, set union and set difference procedures.
A few systems introduced the concept of ‘Exclusive OR’ but
it is not generally useful to users since most users do not
understand it.

1. Standard Boolean
It has the following strengths:
1. It is easy to implement and it is computationally efficient [Frakes and
Baeza-Yates 1992]. Hence, it is the standard model for the current
large-scale, operational retrieval systems and many of the major on-line
information services use it.
2. It enables users to express structural and conceptual constraints to
describe important linguistic features [Marcus 1991]. Users find that
synonym specifications (reflected by OR-clauses) and phrases
(represented by proximity relations) are useful in the formulation of
queries [Cooper 1988, Marcus 1991].
3. The Boolean approach possesses a great expressive power and clarity.
Boolean retrieval is very effective if a query requires an exhaustive and
unambiguous selection.
4. The Boolean method offers a multitude of techniques to broaden or
narrow a query.
5. The Boolean approach can be especially effective in the later stages of the
search process, because of the clarity and exactness with which
relationships between concepts can be represented.

The standard Boolean approach has the following shortcomings:
1. Users find it difficult to construct effective Boolean queries for several reasons
[Cooper 1988, Fox and Koll 1988, Belkin and Croft 1992]. Users are using the
natural language terms AND, OR or NOT that have a different meaning when
used in a query. Thus, users will make errors when they form a Boolean query,
because they resort to their knowledge of English.
2. Only documents that satisfy a query exactly are retrieved. The AND operator is
too severe because it does not distinguish between the case when none of
the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and
four criteria are combined with the Boolean operator AND (referred to as the
Null Output problem). On the other hand, the OR operator does not reflect
how many concepts have been satisfied. Hence, often too many documents
are retrieved (the Output Overload problem).
3) It is difficult to control the number of retrieved documents. Users are often
faced with the null-output or the information overload problem and they are
at loss of how to modify the query to retrieve the reasonable number
documents.
4) The traditional Boolean approach does not provide a relevance ranking of the
retrieved documents, although modern Boolean approaches can make use of
the degree of coordination, field level and degree of stemming present to
rank them [Marcus 1991].
5) It does not represent the degree of uncertainty or error due the vocabulary
problem [Belkin and Croft 1992].

1.2 Narrowing and Broadening Techniques
A Boolean query can be described in terms of the following
four operations:
i. degree and type of coordination,
ii. proximity constraints,
iii. field specifications and
iv. degree of stemming as expressed in terms of
word/string specifications.
If users want to (re)formulate a Boolean query then they
need to make informed choices along these four
dimensions to create a query that is sufficiently broad or
narrow depending on their information needs.

Most narrowing techniques lower recall as well as raise
precision, and most broadening techniques lower
precision as well as raise recall.
Any query can be reformulated to achieve the desired
precision or recall characteristics, but generally it is
difficult to achieve both.
Each of the four kinds of operations in the query
formulation has particular operators, some of which
tend to have a narrowing or broadening effect. For each
operator with a narrowing effect, there is one or more
inverse operators with a broadening effect [Marcus
1991].
Hence, users require help to gain an understanding of how
changes along these four dimensions will affect the
broadness or narrowness of a query.

How the four dimensions affect the broadness or
narrowness of a query is as the following :
1) Coordination: the different Boolean operators AND, OR
and NOT have the following effects when used to add a
further concept to a query: a) the AND operator narrows
a query; b) the OR broadens it; c) the effect of the NOT
depends on whether it is combined with an AND or OR
operator. Typically, in searching textual databases, the
NOT is connected to the AND, in which case it has a
narrowing effect like the AND operator.
2) Proximity: The closer together two terms have to appear
in a document, the more narrow and precise the query.
The most stringent proximity constraint requires the two
terms to be adjacent.

3) Field level: current document records have fields
associated with them, such as the "Title", "Index",
"Abstract" or "Full-text" field: a) the more fields that are
searched, the broader the query; b) the individual fields
have varying degrees of precision associated with them,
where the "title" field is the most specific and the "full-
text" field is the most general.
4) Stemming: The shorter the prefix that is used in
truncation-based searching, the broader the query. By
reducing a term to its morphological stem and using it
as a prefix, users can retrieve many terms that are
conceptually related to the original term [Marcus 1991].

1.3 Smart Boolean
There have been attempts to help users overcome some of
the disadvantages of the traditional Boolean discussed
above. We will now describe such a method,
called Smart Boolean, developed by Marcus [1991, 1994]
that tries to help users construct and modify a Boolean
query as well as make better choices along the four
dimensions that characterize a Boolean query.
We are not attempting to provide an in-depth description
of the Smart Boolean method, but to use it as a good
example that illustrates some of the possible ways to
make Boolean retrieval more user-friendly and effective.
Table 2.2 provides a summary of the key features of the
Smart Boolean approach.

Users start by specifying a natural language statement that is
automatically translated into a Boolean Topic representation.
If the statement is consisted with list of factors or concepts,
then they (factors or concepts) are automatically coordinated
using the AND operator. If the user at the initial stage can or
wants to include synonyms, then they are coordinated using
the OR operator.
Hence, we understand that the Boolean Topic representation
connects the different factors using the AND operator where
the factors can consist of single terms; or several synonyms
connected by the OR operator.
One of the goals of the Smart Boolean approach is to make use
of the structural knowledge contained in the text surrogates,
where the different fields represent into contexts of useful
information. Further, the Smart Boolean approach wants to
use the fact that “related concepts can share a common
stem”. For example, the concepts "computers" and
"computing" have the common stem comput*.

The initial strategy of the Smart Boolean approach is to start out
with the broadest possible query within the constraints of how
the factors and their synonyms have been coordinated. Hence,
it modifies the Boolean Topic representation into the query
surrogate by using only the stems of the concepts and searches
for them over all the fields. Once the query surrogate has been
performed, users are guided in the process of evaluating the
retrieved document surrogates. It also create user feedback
with a list of reasons.
They choose from a list of reasons to indicate why they consider
certain documents as relevant. Similarly, they can indicate why
other documents are not relevant by interacting with a list of
possible reasons. This user feedback is used by the Smart
Boolean system to automatically modify the Boolean Topic
representation or the query surrogate, whatever is more
appropriate. The Smart Boolean approach offers a rich set of
strategies for modifying a query based on the received
relevance feedback or the expressed need to narrow or
broaden the query

Visualizing Boolean Queries through InfoCrystal:
How can we make visualization of Boolean Queries without
limiting its expressive power ?
InfoCrystal can be used to make Boolean retrieval more
transparent and easy-to-use. InfoCrystal make it much
easier for users to formulate and modify Boolean queries
and to achieve the desired retrieval results.
InfoCrystal is nothing but a representation of a specified
Boolean query. Each interior icon of the InfoCrystal
represents a distinct Boolean relationship among the
input criteria , hence, users can specify Boolean queries
by interacting with a direct manipulation interface.

The InfoCrystal acts as a Boolean calculator. Users do not
have to use logical operators and parentheses explicitly
to formulate queries. Hence, users do not have to
concern themselves with the coordination problem.
Instead they need to recognize the relationships of
interest and select them. If an interior icon is selected,
then it changes its visual appearance. In the figures of
this (manipulation) interface, the center area of selected
interior icons are displayed in black and the unselected
ones in white

1.4 Extended (or Weighted) Boolean Models
To address the following issues generally the P-norm and
the Fuzzy Logic approaches that extend the Boolean
model are used.
1) The Boolean operators are too strict and ways need to
be found to soften them.
2) The standard Boolean approach has no provision for
ranking. The Smart Boolean approach and the methods
described in this section provide users with relevance
ranking [Fox and Koll 1988, Marcus 1991].
3) The Boolean model does not support the assignment of
weights to the query or document terms. We will
briefly discuss to address the above issues.

The P-norm method developed by Fox (1983) allows query
and document terms to have weights, which have been
computed by using term frequency statistics with the
proper normalization procedures. These normalized
weights can be used to rank the documents in the order
of decreasing distance from the point (0, 0, ... , 0) for an
OR query, and in order of increasing distance from the
point (1, 1, ... , 1) for an AND query. Further, the Boolean
operators have a coefficient P associated with them to
indicate the degree of strictness of the operator (from 1
for least strict to infinity for most strict, i.e., the Boolean
case). The P-norm uses a distance-based measure and
the coefficient P determines the degree of
exponentiation to be used. The exponentiation is an
expensive computation, especially for P-values greater
than one.

In Fuzzy Set theory, an element has a varying degree of
membership to a set instead of the traditional binary
membership choice. The weight of an index term for a
given document reflects the degree to which this term
describes the content of a document. Hence, this weight
reflects the degree of membership of the document in
the fuzzy set associated with the term in question. The
degree of membership for union and intersection of two
fuzzy sets is equal to the maximum and minimum,
respectively, of the degrees of membership of the
elements of the two sets. In the "Mixed Min and Max"
model developed by Fox and Sharat (1986) the Boolean
operators are softened by considering the query-
document similarity to be a linear combination of the
min and max weights of the documents

Weighting is the process of assigning an importance to an
index term’s use in an item. The weight should represent
the degree to which the concept associated with the
index term is represented in the item. The weight should
help in discriminating the extent to which the concept is
described in items of the database.
The manual process of assigning weights adds additional
overhead on the indexer and requires a more complex
data structure to store the weights.
In a weighted indexing system, an attempt is made to place
a value on the index term’s representation of its
associated concept in the document. An index term’s
weight is based upon a function associated with the
frequency of occurrence of the term in the item.

Typically, values for the index terms are normalised between
zero and one. The higher the weight, the more the term
represents a concept discussed in the item. The weight
can be adjusted to account for other information such as
the number of items in the database that contain the
same concept.
The query process uses the weights along with any weights
assigned to terms in the query to determine a scalar value
(rank value) used in predicting the likelihood that an item
satisfies the query. The results are presented to the user
in order of the rank value from highest number to lowest
number.

Table above summarizes the defining characteristics of the
Extended Boolean approach and list the its key advantages
and disadvantages

If weights are assigned to the terms between the values
0.0 to 1.0, they may be interpreted as the significance
that users are placing on each term. The value 1.0 is
assumed to be the strict interpretation of a Boolean
query. The value 0.0 is interpreted to mean that the user
places little value on the term. Under these
assumptions, a term assigned a value of 0.0 should have
no effect on the retrieved set. Thus:
“A1 OR B0” should return the set of items that
contain A as a term.
“A1 AND B0” will also return the set of items that
contain term A.
“A1 NOT B0” also return set A.

Under the strict interpretation “A1 OR B1” would include all
items that are in all the areas in the Venn diagram. “A1 OR
B0” would be only those items in A (i.e., the green and
Blue shaded areas) which is everything except items in “B
NOT A” (the Blue area).
Thus, as the value of query term B goes from 0.0 to 1.0,
items from “B NOT A” are proportionally added until at 1.0
all of the items will be added.
Similarly, under the strict interpretation “A1 AND B1” would
include all of the items that are in the green and Blue
shaded areas. “A1 AND B0” will be all of the items in A as
described above. Thus, as the value of query term B goes
from 1.0 to 0.0 items will be proportionally added from “A
NOT B” (Green area) until at 0.0 all of the items will be
added.

Finally, the strict interpretation of “A1 NOT B1” is Green
area while “A1 NOT B0” is all of A. Thus as the value of B
goes from 0.0 to 1.0, items are proportionally added
from “A AND B” (green and Blue shaded area) until at
1.0 all of the items have been added.
The final issue here is the determination of which items
are to be added or dropped in interpreting the weighted
values.

2. Statistical Model
The vector space and probabilistic models are the two
major examples of the statistical retrieval approach. Both
models use statistical information in the form of term
frequencies to determine the relevance of documents
with respect to a query. Although they differ in the way
they use the term frequencies, both produce as their
output a list of documents ranked by their estimated
relevance. The statistical retrieval models address some
of the problems of Boolean retrieval methods, but they
have disadvantages of their own.

Statistical Model
1. Vector Space Model
2. Probabilistic Model
3. Latent Semantic Indexing

2.1 Vector Space Model
• Vector space model or term vector model is an
algebraic/statistical model for representing text
documents (and any objects, in general) as vectors of
identifiers, such as, for example, index terms. It is used
in information filtering, information retrieval, indexing and
relevancy rankings.
• The Vector Space Model (VSM) is a way of representing
documents through the words that they contain.
• The VSM allows decisions to be made about which
documents are similar to each other to keyword queries

In the Vector Space Model or system, emphasis is given in
the weights as a foundation for information detection and
stores these weights in a vector form.
In systems based upon a vector model, the semantics of
every item are represented as a vector.
What is a Vector?
A vector is a one-dimensional set of values, where the
order/position of each value in the set is fixed and
represents a particular domain. Each vector represents a
document and each position in a vector represents a
different unique word to represent the document in the
database.

There are two approaches to the domain of values in the
vector – binary and weighted
Binary: represents document (processing token) by 1 or 0
1 representing the existence of the processing
token in the item.
0 representing the non-existence of the processing
token in the item
Weighted: represents document by keywords with set of
all real positive numbers. The value assigned to
each position is the weight of that term in the
document. A value of zero indicates that the word
is not in the document

Queries can be translated into the vector form. Search is
accomplished by calculating the distance between the
query vector and the document vector. The use of
weights also provides a basis for determining the rank of
an item.
The vector approach allows for a mathematical and a
physical representation using a vector space model.

If a query (q) is considered to be a line in an imaginary
space and the document (d) is also considered to be a
line in the imaginary space, the geometrically
determined angle between the two lines can be
understood as measuring the degree to which the
documents are similar to the query. While in the case of
a large angle the document is presumed to be dissimilar
to the query, in the case of a very small angle the
document is presumed to be highly similar to the
question.

How the Vector Space Model indexing procedure works?
The Vector Space Model procedure can be divided into
three stages:
The first stage is the document indexing where the content
bearing terms are extracted from the document text. It is
obvious that many of the words in a document do not
describe the content, like, the, is, are, in, to, of, etc.
These are called non-significant words or stop words. In
case of automatic indexing, these terms are removed
from the document vector, so the document will only be
represented by the content-bearing terms. In general,
40-50% of the total number of words, in a document, are
stop words. These can be removed with the help of a
stop word list.

The second stage is the weighting of the indexed terms to
enhance retrieval of document relevant to the user.
The last stage ranks the document with respect to the
query according to a similarity measure.

Documents and queries are represented as vectors.
dj = (w1,j, w2,j, ……, wt,j)
qj = (w1,q, w2,q, ……, wn,q)
Each dimension corresponds to a separate term. If a term occurs in
the document, its value in the vector is non-zero.
Several different ways of computing these values, also known as
(term) weights, have been developed. One of the best known
schemes is tf-idf (term frequency–inverse document frequency)
weighting.
The definition of term depends on the application (i.e. whether
article, books, etc). Typically terms are single words, keywords, or
longer phrases. If words are chosen to be the terms, the
dimensionality of the vector is the number of words in the
vocabulary (the number of distinct words occurring in
the corpus).
Vector operations can be used to compare documents with queries.

Relevance rankings of documents in a keyword search can
be calculated, using the assumptions of document
similarities theory, by comparing the deviation of angles
between each document vector and the original query
vector where the query is represented as a vector with
same dimension as the vectors that represent the other
documents.
In practice, it is easier to calculate the cosine of the angle
between the vectors, instead of the angle itself:

The VSM is contrary to the Boolean Retrieval Model in which
retrieval is based on the hundred percent (exact) match. The VSM
allows retrieval of the most similar to the query without the exact
match. Thus, the VSM can be well explained in terms of keyword-
by-document matrix (A), in which the rows correspond to
keywords (W) in the database and the columns correspond to
documents (D), then the matrix will be like:
D1 D2 D3 D4 ….. Dn
W1 A11 A12 A13 A14 ….. A1n
W2 A21 A22 A23 A24 ….. A2n
A = W3 A31 A32 A33 A34 ….. A3n
W4 A41 A42 A43 A44 ….. A4n
..... …. …. …. …. ….. ….
Wm Am1 Am2 Am3 Am4 ….. Amn

Let us take a hypothetical example, like, an information seeker
searches information on “Education information retrieval
system”.
He uses four keywords: W1, W2, W3, and W4.
After searching the database,
he gets six articles: A1, A2, A3, A4, A5, and A6.
After analysis, it is found that the
Article A1 talks only about W1;
Article A2 discusses 33% topic of W2 and 67% of W4;
Article A3 deals with 20% of W1, 30% of W3 and 50% of W4;
Article A4 deals with 60% of W1, 10% of W2 and 30% of W4;
Article A5 talks 80% about W2 and 20% about W3;
Article A6 discusses only about W4.
Now this can be denoted in the form of a 4X6 matrix as below:

A1 A2 A3 A4 A5 A6
W1 1.00 0.00 0.20 0.60 0.00 0.00
W2 0.00 0.33 0.00 0.10 0.80 0.00
A = W3 0.00 0.00 0.30 0.00 0.20 0.00
W4 0.00 0.67 0.50 0.30 0.00 1.00

The VSM is a retrieval model which constitutes a fairly large class of retrieval
methods, each consisting of an indexing method and a retrieval function, The
indexing method generates description vectors, and the retrieval function
generates retrieval status values by comparing the query description vector
with the document description vectors.
The information seeker is assumed to have information need, which he formulates
as a query. The query q and the document dj are indexed in two steps.
First appropriate indexing features are spotted in the query q and in the document
dj.
Secondly, these features are assigned weights to obtain the query description and
the document descriptions are sets of weighted indexing features. These are
called document description vector and query vector. The query description
and document descriptions are matched and a score is generated for every
document pair. These scores are called Retrieval Status Values (RSVs). For every
query, the documents are presented to the information seeker in descending
order of these RSVs.

Each keyword in a document collection forms document vector
which represents the single or multiple occurrences of the term
i in document d.
Similarly, a query is represented by a query vector which denotes
the number of occurrences of terms in the query.
Both the document vector and query vector provide the locations
of the objects in the term-document space. There are two
common one-dimensional measures that every vector has,
length and angle with respect to a fixed point. The angle
between two vectors refers to the measure in degrees between
those two vectors. The document vector whose angle is closest
to the query vector’s angle is the best choice, yielding the
document most closely related to the query. It is measured in
terms of cosine angle between the two vectors. If the cosine of
the angle is 1, then the angle between the document vector and
the query vector measures 0 degree, meaning the document
vector and the query vector move in the same direction. A
cosine measure of 0 would mean the document is unrelated to
the query vector. Thus, a cosine measure close to 1 means that
the document is closely related to the query.

d2 . q
= -------------------
||d2|| ||q||

Information storage and retrieval

Information storage and retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Information storage and retrieval

Similar to Information storage and retrieval (20)

More from Dr. Utpal Das

More from Dr. Utpal Das (20)

Recently uploaded

Recently uploaded (20)

Information storage and retrieval