SlideShare a Scribd company logo
INFORMATION STORAGE
AND RETRIEVAL SYSTEM
Dr. Utpal Das
Dibrugarh University,
Dibrugarh, Assam
utpalishaan@gmail.com
Break up of Terminology
INFORMATION /STORAGE/ RETRIEVAL /SYSTEM
KNOWLEDGE
INFORMATION
DATA
MEDIA DATABASES: Bibliographic
Full Text
STORAGE stand-alone databases
hypertext networked databases
SYSTEM DBMS
CLASSIFICATION SCHEMES
INDEXES
Books, Journals, Articles, Audio, Video, Cartographs
Text, Sound, Image, Data
RETRIEVAL Recall
Searching
Recovering
Interpreting
Query Analysis
System Mechanism
Framework
Mode of Arrangement
Interconnected Network
A set of Principle or Procedure
Organized scheme or Method
Modus Operandi
Genesis
The term “Information Retrieval System” was coined by
Calvin Mooers in 1952.
IRS gained popularity in the research community in the
early sixties only when computers were being introduced
in information handling and management.
These information retrieval systems are basically nothing
but document retrieval system, since they were designed
to retrieve bibliographic information of stored documents
databases in response to a search request by the users.
Genesis
Though the basics of IRS is still the same, due to application
of present advanced techniques , the role and scope of IRS
has been much widened. Therefore the connotation of
information retrieval has changed and it has been variously
termed by information professionals and researchers, like:
Information Storage and Retrieval System,
Information Organization and Retrieval System,
Information Processing and Retrieval System,
Text Retrieval System,
Information Representation and Retrieval
System,
Information Access System.
Genesis
The modern connotations implies that IRS presently
deals not only with textual information but also with
multimedia information comprising text, audio, images
and video.
While many features of conventional text retrieval
systems are equally applicable to multimedia information
retrieval, the specific nature of audio, image and video
information have called for the development of many
new tools and techniques for information retrieval.
Thus, modern information retrieval systems deal with
storage, organization and access to text, as well as
multimedia information resources.
Meaning, Definition and Concept of ISRS
 ISRS is a selective, systematic recall of logically stored
information
 ISRS is the science of searching for information in
documents, searching for documents themselves,
searching for metadata which describe documents, or
searching within databases, whether relational stand-
alone databases or hypertext networked databases such
as the Internet or World Wide Web or intranets, for text,
sound, images or data
Meaning, Definition and Concept of ISRS
 An ISRS is an information system, that is, a system used to
store items of information that need to be processed,
searched, retrieved, and disseminated to various user
populations
 It is a process of searching some collection of documents,
using the term document in its widest sense, in order to
identify those documents which deal with a particular
subject. Any system that is designed to facilitate this
literature searching may legitimately be called an
information retrieval system.
Meaning, Definition and Concept of ISRS
ISRS is the study of systems for indexing, searching, and
recalling data, particularly text or other unstructured
forms
Information retrieval may be defined as the technique
and process of searching, recovering, and interpreting
information from large amounts of stored data.
It is recovery of information, especially in a database
stored in a computer
Meaning, Definition and Concept of ISRS
IR is essentially concerned with structure and operation
for devices to select the documentary information and
response to search query
IRS does not inform the user on (change the knowledge
of) subject of his enquiry, it merely inform him of the
existence or non existence and where about of
document relating to his request.
Meaning, Definition and Concept of ISRS
An information retrieval system is designed to
analyse, process and store sources of information
and retrieve those that match a particular user’s
requirements
[Chowdhury, G.G. (2004). Introduction to modern
information retrieval. 2nd ed. London: Facet
Publishing. 2004].
Meaning, Definition and Concept of ISRS
Basic aspects of ISRS:
Information Storage and Retrieval (ISAR) system deals with
three basic aspects:
Information representation
Information storage and organisation
Information access.
Meaning, Definition and Concept of ISRS
BROAD OUTLINE
Information
sources
Analysis &
Representation
Organised
Information
Retrieved
Information Matching
Users Query
Analysis
Analysed
Queries
Meaning, Definition and Concept of ISRS
Functional View of Standard IR System
CHARACTERISTICS OF ISAR SYSTEMS
Information Facilitator
The ISAR system should act as facilitator between the
information (contained in document) and the users. If a
user approaches with the subject term, name of contributors
or title of the document and so on, the system should be
helpful to give him the desired information. The information
could be exact information or the reference of a document
which contains information
CHARACTERISTICS OF ISAR SYSTEMS
Non-Ambiguous
The system should be so organized that ambiguity of
information is avoided so that search result is free from
any kind of ambiguity. This requires identification of
terms, setting their context and their proper indexing.
For example, search for a term ‘screw driver’ should not
bring results like ‘truck driver’, ‘hardware driver’ and so
on.
OBJECTIVES OF ISAR SYSTEMS
Minimum Time
The system should be so designed that minimum effort and
time are spent to interrogate the system. Searching
through the system should take minimum time, meaning
thereby that the ISAR should be capable of performing
fast search. Not only that, it is best to have an online
ISAR so that users do not need to walk to library. They
should get whatever they want at there work place.
OBJECTIVES OF ISAR SYSTEMS
User Friendliness
Ease of use is an important consideration for any ISAR system.
Any ISAR should have user friendly interface. The important
aspects of ISAR should be highlighted. Before a user uses the
system he/she should be properly introduced to the system
with all its features, i.e., informing users about the scope of
system, available search options, and most importantly how
to perform search with the system. It is only this interface
through which a user operates an ISAR system. Take an
example of a Library OPAC. It should have following features:
Introduction to library
Scope of collection
Instructions for performing search
CHARACTERISTICS OF ISAR SYSTEMS
User Friendliness
The search interface should facilitate framing the search
like:
Keyword search
Author and title search
Combination search (using Boolean operators)
Proximity search, etc.
CHARACTERISTICS OF ISAR SYSTEMS
Others
The desirability of making systems as readily usable as
possible for their clienteles
The need to recognise basic features of retrieval system
To incorporate coordinating features such as vocabulary
control, search strategies, user-interface, information
modelling aspects in general, etc.
CHARACTERISTICS OF ISAR SYSTEMS
The competence and compatibility for consolidated
searching and retrieval of information from any client
terminal from any database within the system.
It should be able to narrowcast or broadcast or relate the
information need in a variety of associations to get
optimum retrieval performance.
It should have access facilities at multi-points.
It should have common command language facility to
retrieve information from several databases of the
system
CHARACTERISTICS OF ISAR SYSTEMS
It should be able to handle information access from entity-
related or object-oriented approaches. It may also
provide all other associations for accessing information.
In a bibliographic or full-text database, the surrogates
chosen should have indicative as well as informative
features that are sufficient enough to select or reject the
retrieving information based on end-users’ needs.
It should have the ability to select, classify, process and
consolidate the analysed information into a cohesive text
ready for assimilation by the end-users.
CHARACTERISTICS OF ISAR SYSTEMS
It should have ability to orient the information to specialist
needs of the users from time to time. This calls for
understanding the processing of user profiles.
It should be able to retrieve maximum information with
minimum number of clues.
The fuzzy approaches of end-users must be able to get
clarified and ultimate result should provide satisfaction
to the searcher.
It should have capacity to interchange the information
available in one database or another for purposes of
retrieval relevance end usage.
CHARACTERISTICS OF ISAR SYSTEMS
It should have bibliographic data interchange capacity
(using Z39.50 or similar standard) to meet consolidation
to a chosen format for networking and other purposes.
Compatibility with standards at all levels must be the goal.
It should have ability to search simple information quickly
in an easy manner and also have the ability to multi-
track the complex questions and present them in a
simple easy manner. User-friendly presentations are very
important.
FUNCTIONS
To identify the information (sources) relevant to the areas
of interest of the target user’s community; this is a
challenging job especially in the web environment
where virtually everybody in the world can be the
potential user of a web based information retrieval
system.
To analyse the contents of the sources (documents); this is
becoming increasingly challenging as the size, volume
and variety of information sources (documents) is
increasing rapidly; web information retrieval is carried
out automatically using specially designed programs
called spiders.
FUNCTIONS
To represent the contents of analysed sources in a way that
matches users’ queries; this is done by automatically
creating one or more index files, and is becoming an
increasingly complex task due to the volume and variety
of content and increasing user demands.
To analyse users’ queries and represent them in a form that
will be suitable for matching the database; this is done in
a number of ways, through the design of sophisticated
search interfaces including those that can provide some
help to users for selection of appropriate search terms by
using dictionary and thesauri, automatic spell checkers, a
predefined set of search statements and so forth.
FUNCTIONS
To match the search statement with the stored database; a
number of complex information retrieval models have
been developed over the years that are used to determine
the similarity of the query and stored documents.
To retrieve relevant information; a variety of tools and
techniques are used to determine the relevance of
retrieved items and their ranking.
To make continuous changes in all aspects of the system,
keeping in mind the rapid developments in information
and communication technologies (ICTs) relating to
changing patterns of society, users and their information
needs and expectations.
Design of Information Retrieval System
To design and develop an ISAR system one needs to
recognize the need of the users as all the
subsequent activities are dependent upon these.
When designing, ISAR systems should follow system
development life cycle (SDLC) for greater
efficiency and effectiveness of the systems.
System Development Life Cycle Phases:
1. System Planning:
i. Defining the problems,
ii. Objectives and need
iii. Resources (such as personnel
and costs).
After analyzing data for planning one will have three
choices:
Develop a new system,
Improve the current system or
leave the system as it is.
2. System Analysis:
i. Determining end-user’s requirements,
ii. Their expectations from the system,
iii. Performance of the System
iv. Feasibility study
3. System Design:
i. Elements of a system,
ii. Components,
iii. Security level,
iv. Modules,
v. Architecture
vi. Interfaces
vii. Type of data
(system design meets all functional and technical requirements,
logically and physically)
4. Implementation and Deployment
i. it’s the actual construction process
ii. In Software Development Life Cycle, the
actual code is written here
iii. In Hardware Development Life Cycle, the
implementation phase will contain
configuration and fine-tuning
iv. System becomes ready to become running,
live and productive
5. System Testing and Integration
i. Introducing the system to different inputs
ii. obtaining its outputs and analyze behavior
iii. Observing the way it functions
(Testing is important to ensure customer’s satisfaction,
and it requires no knowledge in coding, hardware
configuration or design)
6. System Maintenance
i. periodic maintenance to prevent redundancy
ii. Replacing the old hardware
iii. Periodical evaluation of system’s performance,
iv. latest updates for certain components with latest
technologies to face current security threats.
Steps for Design of Information Retrieval System
Steps for designing an Information Retrieval System:
i. Recognizing the need for development of ISAR system
ii. Recognizing the information needs of the users
iii. Identification of users need
iv. Type(s) of databases to be incorporated into the system
v. Features to be incorporated in the databases
vi. Preparation of structured queries
vii. Design and development of various components of the
system such as user interface, search agent, etc.
viii. Evaluation of the system
ix. Re-designing/Modification of ISAR system, if needed.
Need & Purpose
The basic purpose of ISRS is the satisfy information needs
of various classes of Users:
a) Current Information Need,
b) Exhaustive Information Need,
c) Every day Information Need, and
d) Catching-up or Brushing-up Information
Need
Need & Purpose
An IRS is designed to retrieve the documents or information
required by the user community.
It should make the right information available to the right
user. Thus, an information retrieval system aims to collect
and organize information in one or more subject areas in
order to provide it to users as soon as they ask for it.
A writer presents a set of ideas in a document using a set of
concepts.
Need & Purpose
Somewhere there are users who require the ideas but
may not be able to identify them; in other words ,
some people lack the ideas put forward by the
author in their work.
IRS match the writer’s ideas expressed in the
document with the user’s requirements for them.
Thus, an IRS serves as a bridge between the world of
creators or generators of information and the users
of that information.
Components for Design of ISRS
An ISAR system has 3 basic components:
I. User Interface
II. Knowledge Base
III. Search Agent
Components for Design of ISRS
I. User Interface:
User interface is the front page or the front-end or (User’s)
operational area of the system which enables user to
put a query and displays results.
It is of two types:
i. Query Interface
ii. Result Interface
i. Query Interface:
This is the end from where users enter his/her search
terms and initiate communication with the system. The
Query Interface generally need to have following
features:
a) Understanding the user input statement
This front-end interface needs to understand the
keywords given by the users and capture them to pass
on to the search program. The front-end should have
understandable look and feel, distinguishable colour
combinations, and search instructions.
b) Refining the problem statement
The interface should have ability or flexibility for further
refining any query or statement, narrow down from broader
to specific search or further modification within the displayed
search results with some kind of arrangement among topical
terms which further facilitate browsing through the system.
c) Search statement to search strategy translation
The system front-end should have the ability to translate a
search statement and formulate a search strategy in the
programming language which is understood by Search Agent.
For example, interfaces built in a Relational Database
Management System (RDBMS) environment, accepts search
statement in Structured Query Language (SQL) format and
formulate the search strategy with the help of Search Agent
(like Boolean Operators or any other algorithms) .
d) Modification of search strategy
If one does not get desired output from the database, ISAR
system should have procedure for further modification of
search strategy. The modification should be interactive.
Vocabulary control devices can also be added as an aid
for users to locate the term of his/her interest.
For Example: Modifying search with the help of other
options like ‘Contains’, ‘Exact’, ‘Begins with’, ‘Ends with’,
etc.
ii. Result Interface
In the Result Interface, display of search results
should be user friendly.
Not only that the result should cater the needs of
individual users but the display should also be
customized (like e-resource publishers interface).
Search results should also display the ratings in the
light of search terms. For this purpose statistical
techniques can be used.
Components for Design of ISRS
II. Knowledge Base
The store house of any ISAR system is its Knowledge Base. It
contains list of facts or related facts (information). Any kind of
query is answered based on the facts stored in the Knowledge
Base. A Knowledge Base could be a Database Management
System (DBMS).
knowledge base (KB) is a technology used
to store complex structured and unstructured information used
by a computer system.
A knowledge-based system consists of a knowledge-base that
represents facts about the world and an inference engine that
can reason about those facts and use rules and other forms of
logic to deduce new facts or highlight inconsistencies
Retrieval of information from storage depends
on two important aspects of Knowledge Base:
A. Knowledge Representation
B. Indexing and Clustering
A. Knowledge Representation:
The first and foremost objective in constructing an
ISAR system is representation of facts within the
Knowledge Base.
There are different ways of representation of
knowledge:
a) Semantic Network Knowledge Representation
b) Frame Based Knowledge Representation
c) Rule-Based Knowledge Representation
a) Semantic Network Knowledge Representation
Semantic network is a method of knowledge representation
based on a network structure. A semantic network
contains points called nodes connected by links called
as arcs. The nodes represent objects, concepts or
events - in other words documents or information. The
arcs are used to represent the relations between the
nodes. Arcs build a kind of hierarchies in the Knowledge
Base. Arcs usually represent relations like is_a or
has_part.
Semantic networks are useful in representation of
sentences of natural language.
Semantics is the linguistic and philosophical study
of meaning, in language, programming languages,
formal logics, and semiotics.
It is concerned with the relationship between signifiers—
like words, phrases, signs, and symbols—and what they
stand for in reality, their denotation.
In LISP Programming Language:
(setq *database*
'((canary (is-a bird)
(color yellow)
(size small))
(penguin (is-a bird)
(movement swim))
(bird (is-a vertebrate)
(has-part wings)
(reproduction egg-laying))))
Also, setq can be used to assign different values to different
variables. The first argument is bound to the value of the
second argument, the third argument is bound to the
value of the fourth argument, and so on. For example,
you could use the following to assign a list of trees to the
symbol trees and a list of herbivores to the
symbol herbivores:
(setq trees '(pine fir oak maple)
herbivores '(gazelle antelope zebra))
To set the value of the variable carnivores to the
list '(lion tiger leopard) using setq, the following
expression is used:
(setq carnivores '(lion tiger leopard))
This is exactly the same as using set except the first
argument is automatically quoted by setq. (The ‘q’
in setq means quote.)
With set, the expression would look like this:
(set 'carnivores '(lion tiger leopard))
Complexity in Semantic Network Knowledge Representation
The idea of semantic networks started out as a natural way to
represent labelled connections between entities. But, as the
representations are expected to support increasingly large
ranges of problem solving tasks, the representation schemes
necessarily become increasingly complex
In particular, it becomes necessary to assign more structure to
nodes, as well as to links. For example, in many cases we need
node labels that can be computed, rather than being fixed in
advance. It is natural to use database ideas to keep track of
everything, and the nodes and their relations begin to look
more like frames.
b) Frame Based Knowledge Representation
The original idea of frames was developed by Minsky
(1975) who defined them as “data structures for
representing stereotyped situations”, such as going into
a class room.
It is an object-oriented approach. A frame represents an
object (document or information) or class of objects
(collection of documents or information) or several facts.
When they represent a class of objects, they generalize
certain groups identifying overall properties of those
groups, it shares.
The pointers where properties are stored are known as
slots. Similarly, if frame represents an object, slots
represent the properties or attributes of the object.
Slots contain value for that particular attribute.
For example, a book in a library is an object, therefore it
can be represented as frame. The properties of book,
i.e., Title, Author, Place, Publisher and so on are stored
as slots and each slot would have corresponding value.
Frame:
Book
Slots:
Title
Author
Publisher
Place
Size
Value:
Information Storage & Retrieval
G. G. Chaudhury
Ess Ess Publication
New Delhi
18 X 14 cm
The simplest type of frame is just a data structure with
similar properties and possibilities for knowledge
representation as a semantic network, with the same
ideas of inheritance and default values
Frames become much more powerful when their slots can
also contain instructions (procedures) for computing
things from information in other slots or in other frames
Class Room
is-a: Room
Location: Department
Contains: {Desk, Bench,
Black Board,
Table, Chairs..}
:
Class Room Chair
Is a: Chair
Location: Class Room
Height: 20-40cm
Legs: 4
Comfortable: Yes
Use: Sitting
Basic Idea: A frame consists of a selection of slots which
can be filled by values, or procedures for calculating
values, or pointers to other frames. For example:
This type of frames are now generally referred to as Scripts.
Attached to each frame will then be several kinds of
information. Some information can be about how to use
the frame. Some can be about what one can expect to
happen next, or what one should do next. Some can be
about what to do if our expectations are not confirmed.
Then, when one encounters a new situation, one can
select from memory an appropriate frame and this can be
adapted to fit reality by changing particular details as
necessary
A complete frame based representation will consist of a
whole hierarchy or network of frames connected
together by appropriate links/pointers
c) Rule-Based Knowledge Representation
Rule based representation is a popular approach. Rules are
employed to state the way in which the inference has to
be done.
Rules provide a formal way of representing recommendations,
directives, or strategies. Rules are appropriate when the
domain knowledge results from empirical associations
developed through years of experience in solving problems
in a given area.
Rules are expressed in the form of IF-THEN statements.
For example:
If search is in collection of BOOKS THEN display Title,
Author, Place, Publisher, Year, Physical Description, ISBN
If search is in collection of ARTICLES THEN display Title,
Author, Name of Journal, Volume, Issue, Year, ISSN
Rules – antecedent clause (condition) related to a
consequent clause Formalisms (action) by implication if
(A and B) THEN S1
The syntax structure is
IF <premise>THEN<action>
<premise>– is Boolean. The AND, and to a lesser
degree OR and NOT, logical connectives are
possible.
<action>– a series of statements
In a rule based expert system, the domain knowledge is
represented as a set of rules that are checked against a
collection of facts or knowledge about the current
situation.
When the IF portion of the rule is satisfied by the facts, the
action specified by the THEN portion is performed. When
the condition is satisfied the rule is said to ‘fire’ or
‘execute’. A rule interpreter is used to compare the IF
portions of rules with the facts and execute the rule
whose IF portion matches the facts.
This is a real success story of AI – tens of thousands of
working systems deployed into many aspects of life
Normally, the term 'rule-based system' is applied to systems
involving human-crafted or curated rule sets. Rule-based
systems constructed using automatic rule inference, such
as rule-based machine learning, are normally excluded from
this system type
Rule-based systems are used as a way to store and manipulate
knowledge to interpret information in a useful way. They are
often used in artificial intelligence applications and research.
A rule-base system (or production system) is a KBS in which
the knowledge is stored as rules; an expert system is a
RBSs in which the rules come from human experts in a
particular domain
B. Indexing and Clustering
Indexing
An index or database index is a data structure which is used
to quickly locate and access the data in a database table.
Indexing is a way to optimize performance of a database by
minimizing the number of disk accesses required when a
query is processed.
Indexes are created using some database columns:
• The first column is the Search key that contains a copy of
the primary key or candidate key of the table. These values
are stored in sorted order so that the corresponding data
can be accessed quickly (Note that the data may or may
not be stored in sorted order).
• The second column is the Data Reference which contains a
set of pointers holding the address of the disk block where
that particular key value can be found.
Clustered Indexing
• Clustering index is defined on an ordered data file. The data
file is ordered on a non-key field. In some cases, the index is
created on non-primary key columns which may not be
unique for each record. In such cases, in order to identify
the records faster, we will group two or more columns
together to get the unique values and create index out of
them. This method is known as clustering index.
• Basically, records with similar characteristics are grouped
together and indexes are created for these groups.
• For example below, students studying in each semester are
grouped together. i.e. 1st Semester students, 2nd semester
students, 3rd semester students etc are grouped.
III. Search Agent
Search Agents are vital components of any ISAR system.
These are basically programs which takes input from
Search Interface and searches in the Knowledge Base
using existing index. A good ISAR system means efficient
retrieval. Thus, a good search agent must be equipped
with following features:
facility of using Boolean operators
context setting to search terms
use of clustering algorithms
use of phonetic algorithms
(soundex and metaphone algorithms)
Boolean Operators
Boolean Operators are simple words (AND, OR, NOT or AND
NOT) used as conjunctions to combine or exclude keywords
in a search, resulting in more focused and productive
results.
AND and NOT operators increase precision whereas OR
increases recall of search results. The shaded area in the
diagram represents retrieved records in the following
example.
Using these operators can greatly reduce or expand the
amount of records returned.
Boolean operators are useful in saving time by focusing
searches for more 'on-target' results that are more
appropriate to your needs, eliminating unsuitable or
inappropriate.
Each search engine or database collection uses Boolean
operators in a slightly different way or may require the
operator be typed in capitals or have special punctuation.
The specific phrasing will be found in either the guide to
the specific database found in Research Resources or the
search engine's help screens.
AND—requires both terms to be in each item returned. If
one term is contained in the document and the other is
not, the item is not included in the resulting list.
(Narrows the search)
Example: A search on stock market AND trading includes
results contains: stock market trading; trading on the
stock market; and trading on the late afternoon stock
market
OR—either term (or both) will be in the returned
document. (Broadens the search)
Example: A search on ecology OR pollution includes results
contains: documents containing the world ecology (but
not pollution) and other documents containing the word
pollution (but not ecology) as well as documents with
ecology and pollution in either order or number of uses.
NOT or AND NOT ( dependent upon the coding of the
database's search engine)—the first term is searched,
then any records containing the term after the operators
are subtracted from the results. (Be careful with use as
the attempt to narrow the search may be too exclusive
and eliminate good records). If you need to search the
word not, that can usually be done by placing double
quotes (<< >>) around it.
Example: A search on Mexico AND NOT city includes results
contains: New Mexico; the nation of Mexico; US-Mexico
trade; but does not return Mexico City or This city's
trade relationships with Mexico.
Using Parentheses—Using the ( ) to enclose search
strategies will customize your results to more accurately
reflect your topic. Search engines deal with search
statements within the parentheses first, then apply any
statements that are not enclosed.
Example: A search on (smoking or tobacco) and cancer
returns articles containing: smoking and cancer; tobacco
and cancer smoking; cancer, and tobacco; but does not
return smoking or tobacco when cancer is not
mentioned.
Context Setting
Context Setting requires content analysis of document.
Here one analyses document manually or automatically
in order to preserve the context of each term in the
index.
It can be done in two ways:
i. Conceptual Analysis
ii. Relational Analysis.
Conceptual analysis
Conceptual analysis can be thought of as frequency of
concepts. Concept can be represented by texts as well
as pictures. To analyze the concept one looks for the
appearance of words in the text. It is not necessary that
same word appears always, there may be synonymous
terms present.
For example, if one is analyzing a certain document is
about freedom then one should look for the related
words like liberation, independence, etc.
Relational analysis
Relational analysis goes one step further by examining the
relationships among concepts in a text. In relational
analysis we look for what are the related words
appearing next to the word in question.
For example, to see what are the words that appear next to
freedom and then determine the related concepts.
Freedom:
i. Freedom of speech and expression: Article 19 (1) (a) of
Constitution of India, Fundamental Rights & duties, ….
ii. Freedom of opinion and Expression: article 19 of UN
Universal declaration of Human Rights, Citizen’s
responsibility,….
Clustering Algorithms
Clustering is one of the most common exploratory data analysis
technique used to get an intuition about the structure of the
data. It can be defined as the task of identifying subgroups in
the data such that data points in the same subgroup (cluster)
are very similar while data points in different clusters are
very different.
Clustering is a method by which large sets of data is grouped
into groups or clusters of smaller sets of similar data based
on some characteristics.
A cluster refers to a collection of data points aggregated
together because of certain similarities.
For example, in a group of players one can cluster players
according to their specialisation of game, like those who play
cricket, those who play hockey and so on.
A clustering algorithm attempts to identify natural groups
of components or data based on some similarity in a
given population. In other words, it is a method to
create subclass in a given class. The first thing in such
algorithms are identification of core entity which is also
known as centroid.
A centroid is the imaginary or real location representing
the center of the cluster. Around centroid similar kind
of entities are identified.
In a clustering algorithm, our final goal is to represent this
unordered data in an organized way, and divide it into
clusters.
K-means Algorithm
K-means algorithm is an algorithm that tries to partition the
dataset into K-pre-defined distinct non-overlapping
subgroups (clusters) where each data point belongs to only
one group. It tries to make the inter-cluster data points as
similar as possible while also keeping the clusters as different
(far) as possible.
It assigns data points to a cluster such that the sum of the
squared distance between the data points and the cluster’s
centroid (arithmetic mean of all the data points that belong
to that cluster) is at the minimum.
The less variation we have within clusters, the more
homogeneous (similar) the data points are within the same
cluster.
K-Means Clustering
K-means algorithm identifies k number of centroids, and then
allocates every data point to the nearest cluster, while keeping the
centroids as small as possible. The ‘means’ in the K-means refers to
averaging of the data; that is, finding the centroid.
Mean Shift Clustering Algorithm
Mean Shift clustering algorithm is an unsupervised clustering
algorithm that groups data directly without being trained on
labelled data. The nature of the Mean Shift clustering
algorithm is hierarchical in nature, which means it builds on a
hierarchy of clusters, step by step.
Mean Shift essentially starts off with a kernel, which is basically
a circular sliding window. The bandwidth, i.e. the radius of
this sliding window will be pre-decided by the user.
A very high level view of the algorithm can be of :
STEP 1: Pick any random point, and place the window on that
data point.
STEP 2: Calculate the mean of all the points lying inside this
window.
STEP 3: Shift the window, such that it is lying on the location of
the mean.
STEP 4: Repeat till convergence
Mean shift clustering aims to discover “blobs” in a
smooth density of samples. It is a centroid-based
algorithm, which works by updating candidates for
centroids to be the mean of the points within a given
region. These candidates are then filtered in a post-
processing stage to eliminate near-duplicates to form
the final set of centroids
Mean-Shift Clustering: in a single window
What we're trying to achieve here is, to keep shifting the
window to a region of higher density. This is why, we keep
shifting the window towards the centroid of all the points in
the window. This feature of Mean Shift algorithm describes it's
property as a hill climb algorithm
Mean-Shift Clustering: entire process
Density-Based Spatial Clustering
Expectation–Maximization (EM) Clustering using Gaussian
Mixture Models (GMM)
Agglomerative Hierarchical Clustering
Phonetic algorithm
• A phonetic algorithm is a
algorithm for indexing of words by their pronunciation.
Most phonetic algorithms were developed for use with
the English language; consequently, applying the rules to
words in other languages might not give a meaningful
result.
• They are necessarily complex algorithms with many rules
and exceptions, because English spelling and
pronunciation is complicated by historical changes in
pronunciation and words borrowed from
many languages.
Best Known phonetic Algorithms:
i. Metaphone Algorithm (Metaphone, Double
Metaphone, and Metaphone 3)
ii. Soundex
iii. Daitch–Mokotoff Soundex
iv. Cologne phonetics
v. New York State Identification and Intelligence
System (NYSIIS)
vi. Match Rating Approach
vii. Caverphone
Metaphone is an algorithm which encodes pronunciation of
a word letter-by-letter basis, it encodes groups of letters
i.e. a word. Metaphone embodies more accurately the
rules of pronunciation in language. Such algorithms are
well established for English as a language. Both
algorithms return all the words that exactly match the
desired word as well as all similar sounding names.
Metaphone has attained different versions in its
development, like, Double Metaphone , Metaphone 3
etc, depending on its accuracy of spelling check.
Soundex is a phonetic algorithm for indexing names by
sound, as pronounced in English. The goal is
for homophones to be encoded to the same
representation so that they can be matched despite
minor differences in spelling.
Soundex and metaphone algorithms are almost the same
kind of algorithm. Both these algorithms are based in the
way pronunciation of a word is made. In soundex
algorithm, a numeric code is assigned to each character
used in a word and when search is performed, words
with similar codes are also brought out in search result.
Soundex is the most widely known of all phonetic
algorithms is a standard feature of popular database
software such as DB2, PostgreSQL, MySQL,
SQLite, Ingres, MS SQL Server and Oracle) and is often
used (incorrectly) as a synonym for "phonetic
algorithm".[
Common uses
• Spell checkers can often contain phonetic algorithms.
The Metaphone algorithm, for example, can take an incorrectly
spelled word and create a code. The code is then looked up in
directory for words with the same or similar Metaphone. Words
that have the same or similar Metaphone become possible
alternative spellings.
• Search functionality will often use phonetic algorithms to find
results that don't match exactly the term(s) used in the search.
Searching for names can be difficult as there are often multiple
alternative spellings for names.
An example is the name Claire. It has two alternatives, Clare/Clair,
which are both pronounced the same. Searching for one spelling
wouldn't show results for the two others. Using Soundex all
three variations produce the same Soundex code, C460. By
searching names based on the Soundex code all three variations
will be returned.
Evaluation of ISAR systems
Evaluation is a systematic determination of a subject's
merit, worth and significance, using criteria governed by
a set of standards.
It can assist an organization, program, project or any
other intervention or initiative to assess any aim,
realisable concept/proposal, or any alternative, to help
in decision making; or to ascertain the degree of
achievement or value in regard to the aim and objectives
and results of any such action that has been completed.
Evaluation is the structured interpretation and giving of
meaning to predict or actual impacts of proposals or
results. It looks at original objectives, and at what are
either predicted or what was accomplished and how it
was accomplished.
So evaluation can be formative that is taking place during
the development of a concept or proposal, project or
organization, with the intention of improving the value or
effectiveness of the proposal, project, or organization. It
can also be summative, drawing lessons from a
completed action or project or an organization at a later
point in time or circumstance
Evaluation is inherently a theoretically informed approach
and consequently any particular definition of evaluation
would have be tailored to its context - the theory,
approach, needs, purpose, and methodology of the
evaluation process itself.
A systematic, rigorous, and meticulous application of
scientific methods to assess the design, implementation,
improvement, or outcomes of a program. It is a resource-
intensive process, frequently requiring resources, such
as, evaluator expertise, labour, time, and a sizeable
budget.
Evaluation of information retrieval system measure
which of the two existing system perform better
and try to assess how the level of performance of
a given can be improved, i.e. it measures two
parameters:
i. Effectiveness
ii. Efficiency
By effectiveness it means the level up to which the given
system attained its objectives.
Thus in information retrieval system effectiveness may be
measure of how far it can retrieve relevant information
accurately while withholding non-relevant information.
A search engine that is extremely fast is of no use unless it
produces good results.
Efficiency means how economically the system is
achieving its objectives.
In an information retrieval system efficiency can be
measured be factor such as cost. The cost factors are
to be calculated indirectly. They include factor such
as response time, time taken by the system to
provide an answer. User effort, the amount of time
and effort needed by a user to interact with the
system and analysed the output retrieved in order to
get the correct information.
Lancaster state that evaluation of information
retrieval system can be justified by the following
three issues:
1. How well the system is satisfying its objectives
2. How efficiently it is satisfying its objectives and
3. Whether the system justified its existence.
PURPOSE OF EVALUATION
Swanson state seven purposes for evaluation:
1. To assess a set of goals, a programme plan, or a design prior to
implementation.
2. To determine whether and how well goals or performance
expectation are being fulfilled.
3. To determine specific reasons for success and failure.
4. To uncover principles underlying a successful programme.
5. To explore technique for increasing programme effectiveness.
6. To established a foundation of further research on the reason
for the relative success of alternative technique and
7. To improve the means employed for attaining objectives or to
redefine sub goals or goals in view of research findings
Keen give three major purpose of evaluation for an
information retrieval system:
1. The need for measures with which to make merit
comparisons within a single test situation. In other
words, evaluation studies are conducted to compare
the merits or demerits of two or more system
2. The need for measure with which to make comparison
between results obtained in different test situation
3. The need for assessing the merit of a real-life system.
EVALUATION CRITERIA FOR ISRS
Evaluation of Information Retrieval is conduct into
two different viewpoints.
1. Managerial view: when evaluation is conducted
from managerial point of view it is called
managerial oriented evaluation.
2. User view: when evaluation is conducted from
the user point of view it is called user-oriented
evaluation study.
Criteria for evaluation of ISRS (Managerial view)
Lancaster in 1971 proposed five evaluation criteria:
1. Coverage of the system
2. Ability of the system to retrieve wanted items
(i.e. recall)
3. Ability of the system to avoid retrieval of
unwanted items (i.e. precision)
4. The response time of the system, and
5. The amount of effort required by the user
Vickery advocate six criteria for evaluation of ISRS
and grouped into two sets as follows:
Set 1
1. Coverage- the proportion of the total potentially useful
literature that has been analyzed.
2. Recall- the proportion of such references that are
retrieved in a search, and
3. Response time- the average time needed to obtain a
response from the system.
Set 2
4. Precision- the ability of the system to screen out
irrelevant references
5. Usability- the value of the references retrieved, in terms
of such factors as their reliability, comprehensibility,
currency and
6. Presentation- the form in which search results are
presented to the user.
Cleverdon (1966) identified six criteria for the evaluation of
ISRS:
1. Recall- the ability of the system to present all the
relevant items.
2. Precision- the ability of the system to present only those
items that is relevant.
3. Time lag- the average interval between the time the
search request is made and the time an answer is
provided.
4. Effort- intellectual as well as physical required from the
user in obtaining answer to the search request.
5. Form of presentation- search output, which effects the
user ability to make use of the relevant items and
6. Coverage of the collection- the extent to which the
system includes relevant matter.
Criteria for evaluation of ISRS (User-Centred Evaluation)
User base evaluation is the most common
evaluation system advocated by many
information scientists. A criterion for evaluation
of information retrieval system includes:
1. Recall
2. Precision
3. Fallout
4. Generality
The user centred approach examines the information
seeking task in the context of human behaviour in
order to understand more completely the nature of
user interaction with an information system.
User centred evaluation is based on the premise that
understanding user behaviour facilitates more effective
system design.
These studies examine the user from a behavioural
science perspective using methods common to
psychology, sociology, and anthropology.
While examining user centered approaches two
methods can be applied:
Qualitative method of evaluation
Quantitative method evaluation
Qualitative method of evaluation
Qualitative methods of evaluation such as case studies,
focus groups or in-depth interviews can be combined
with objective measures to produce more effective
information retrieval research and evaluation.
Quantitative method evaluation
In Quantitative method evaluation empirical methods
such as experimentation are frequently employed to
observe and probe subjective and affective factors
quantitatively.
According to Saracevic & Kantor (1988), the key to the
future of information systems and searching processes
lies not in increased sophistication of technology, but
in increased understanding of human involvement
with information.
Therefore, there has been an increased interest in
qualitative methods that capture the complexity and
diversity of human experience in information storage
and retrieval system and its process.
Recall
The term recall refers to a measure of whether a particular
item is retrieved or the extent to which the retrieval of
wanted items occurs.
Recall is defined as the proportion of the total relevant
documents that is retrieved out of total relevant
document stored in the collection.
Whenever a user puts his/her query, it is the responsibility
if the system to retrieve all those items that is relevant to
the given query. When the collection is large it is not
possible to retrieve all the relevant items. Thus, a system
is able to retrieve a proportion of the total relevant
document in response to a given query.
The performance of a system is often measured by recall
ratio, which denotes the percentages of relevant items
retrieved in a given situation.
The general formula for calculation of recall may be state
as:
Number of relevant item retrieved
Recall=——————————————————————-- x 100
Total number of relevant items in the collection
Example, if there are 100 documents in a collection that
are relevant to a given query and 60 of these items
are retrieved in a given search, then the recall is
state to be 60%.
Number of relevant item retrieved
Recall=——————————————————————-- x 100
Total number of relevant items in the collection
60
Recall = ——————----- x 100
100
= 60%
In other words the system has been able to retrieve 60%
of the relevant items.
Precision
By precision we mean how precisely a particular system
function. Precision is defined as the proportion of
documents retrieved that is relevant out of total number
retrieved documents.
In precision the non-relevant items is discarded by the user.
The general formula for calculation of precision may be
state as:
Number of relevant item retrieved
Precision=———————————————————x 100
Total number of items retrieved
Example, if in a given search the system retrieves
80 items, out of which 60 are relevant and 20 are
non-relevant, the precision is 75%.
Number of relevant item retrieved
Precision=———————————————————x 100
Total number of items retrieved
60
Precision = ——————x 100
80
= 75%
Recall-precision matrix
The recall is related to the ability of the system to retrieve
relevant documents, and precision related to its ability
not to retrieve non-relevant documents.
The ideal system attempts to achieve 100% recall and
100% precision is not possible in practice, because as
the level of recall increase precision tends to decrease.
According to Lancaster recall and precision tend to vary
inversely.
Following example show the relationship between recall
and precision of a given search:
In a given situation a system:
i. retrieved a+b number of documents, out of which,
ii. a documents are relevant, and
iii. b documents are non-relevant (but retrieved).
iv. c+d document are left in the collection after
the search has been conducted.
v. Out of the c+d number, c document are relevant
to the query but could not be retrieved, and
vi. d document are not relevant (and not retrieved)
and thus have been correctly rejected.
Recall-precision matrix
Relevant Not-Relevant Total
Retrieved a (Hits) b (Noise) a +b
Not-Retrieved c (Misses) d (Rejected) c + d
Total a + c b + d a + b+ c + d
Lancaster suggests that these statistics can be represented
in a 2 x 2 matrix, as shown below:
The system retrieves a relevant document along with b
non-relevant documents.
Thus following Lancaster it can be stated that a denoted
hits and b denotes the noise. Now out of the remaining
c+d document, the system misses c document that
should have been retrieved, but it correctly rejected d
document that are not to the given query. The recall
and precision ratio in this case can be calculated as
R= [a/ (a+c)] x 100
P= [a/ (a+b)] x 100
The value of recall can be increase by increasing the
value of a, that is by retrieving a greater number of
relevant items. This can be achieved by increasing the
number of retrieved document, but as the number
of items retrieved increases, so also increase the
likelihood of retrieval of non -relevant items that is b,
which decreases the value of precision. Lancaster
therefore states that recall and precision tend to vary
inversely. In a retrieval environment when we want to
retrieve more relevant items, we generally broaden our
search
The relationship between recall and precision can be
examine by considering searches held at different
levels with the same set of documents and request.
Beginning with very general search terms high recall and
low precision can be achieved, and as the search terms
becomes more and more specific recall tends to go
down and precision tends to go up.
In real -life situation, user normally does not want very
high recall. In general most users want a few documents
in response to a query, meaning a moderate level of
recall.
Limitations of recall and precision
i. Difference in the level of precision and accuracy:
Different users may want different levels of recall. A
person going to prepare a state-of-the-art report on a
topic would like to have all the items available on the
topics and therefore will go for high recall. Whereas, a
user wanting to know about a given topic will prefer to
have a few items and thus will not require a high
recall.
ii. Difference in judgment on degree of relevance
Another drawback of recall is that it assumes that
all relevant items have the same value, which is
not true. The retrieved items may have different
degree of relevance and this may vary from user to
user, and even form time to time to the same user.
Both recall and precision depend largely on the
relevance judgment of the user
iii. Measures for system performance not for relevance
judgment
Despite their apparent simplicity, these are slippery
concepts, depending for their definition on relevance
judgments which are subjective at best. Because these
criteria are document-based, they measure only the
performance of the system in retrieving items to the
information need. They do not consider how the information
will be used, or whether, in the judgment of the user, the
documents fulfill the information need.
These limitations of precision and recall have been
acknowledged and the need for additional measures and
different criteria for effectiveness has been identified.
Fallout
Fallout ratio is the proportion of non-relevant
items that has been retrieved out of all non-
relevant documents available in a given search
No. of Retrieved Non Relevant document
Fallout = ----------------------------------------- ----------------------x 100
Total No. of Non Relevant document
Generality
Generality ratio is the proportion of relevant items
(retrieved & non retrieved) in a given search
No. of Relevant document
Generality = ----------------------------------------- ------------x 100
Total No. of document
Retrieval Measure
SYMBOL EVALUATION
MEASURE
FORMULA EXPLANATION
R RECALL a/ (a + c) Proportion of relevant items
retrieved
P PRECESSION a/ (a + b) Proportion of retrieved items that
are relevant
F FALLOUT b/ (b + d) Proportion of non-relevant items
retrieved
G GENERALITY (a + c)/
(a+b+c+d)
Proportion of relevant items per
query
Assessment of Evaluation criteria
Different stakeholders, such as information professionals,
systems designers and users, may have different need and
expectations of an IR system and accordingly objectives,
decision, process, design or action of an IR system are set.
Evaluation is a process whose main purpose is to assess
whether the IR system is working what it is expected to
do. These assessment are done by measuring the features
such as Recall, Precision, Fallout and Generality ratio. The
analysis of results of these features determines the
performance level of the IR system in respect to the following
:
Effectiveness
Usability
Satisfaction
Cost
Effectiveness
Effectiveness is the system’s ability or success to retrieve
relevant information which meet the needs of the user.
The two most commonly used measures of system
performance are the recall ratio and the precision ratio
Relevant Not relevant
Retrieved A B
Not retrieved C D
Totals A + C B + D
The search results in the Table above may have four possible outcome:
1. Relevant documents successfully retrieved – A (hits)
2. Non-relevant documents retrieved- B (noise)
3. Relevant documents failed to retrieve - C (miss)
4. Non-relevant document not retrieved and successfully dodged -
D
total relevant retrieved A
Recall= ----------------------------- x 100 = ---------- x 100 = system’s ability
total relevant in system (A + C) to retrieve relevant
Information/Doc
total relevant retrieved A
Precision = ----------------------------- x 100 = ---------- x 100 = system’s ability
total retrieved (A + B) to suppress irrelevant
Information/Doc
total irrelevant retrieved B
Fallout = ----------------------------- x 100 = --------x 100 = system’s ability to
total irrelevant (B+ D) suppress irrelevant
Information/Doc
Thus, assessment of all the above factors, i.e. Recall, Precision &
Fallout actually measures effectiveness of an IR system. Indexing
systems and search software should be designed to
maximize both recall and precision, that is, in other words to
minimize noise and misses.
It may be difficult to measure the total number of relevant
document in an IRS. Because it involves examining every
document in the system for its potential relevance to a specific
search query. For web search engines such as Google this is
clearly impossible
Usability
Usability is part of the broader term “user experience”
and refers to the ease of access and/or use of a product
or website.
A design is not usable or unusable per-se; its features,
together with the context of the user (what the user
wants to do with it and the user’s environment),
determine its level of usability.
The official ISO 9241-11 definition of usability is: “the
extent to which a product can be used by specified users
to achieve specified goals with effectiveness, efficiency
and satisfaction in a specified context of use.”
A usable interface has three main outcomes:
• It should be easy for the user to become familiar with and
competent in using the user interface during the first contact
with the website.
• It should be easy for users to achieve their objective through
using the website. If a user has the goal of booking a flight, a
good design will guide him/her through the easiest process
to purchase that ticket.
• It should be easy to recall the user interface and how to use
it on subsequent visits. So, a good design on the travel
agent’s site means the user should learn from the first time
and book a second ticket just as easily.
Usability is what determines whether a design’s existing
attributes make it stand or fall
Usability denotes
Satisfaction
There is no agreed definition of user satisfaction within the information
science and information system communities. User satisfaction is a
subjective variable, which can be influenced by several factors such
as system effectiveness, user effectiveness, user effort, and user
characteristics and expectations. Therefore, information retrieval
evaluators should consider all these factors in obtaining user
satisfaction and in using it as a criterion of system effectiveness.
Applegate outlines three different models of searcher satisfaction
namely:
The material satisfaction model
The emotional satisfaction- simple path model
The emotional satisfaction- multiple path model
Search result would be an appropriate measure of the
material satisfaction model. Both emotional satisfaction
models are based upon subjective impressions and
assessments which may be affected by factors such as:
Search task
Search setting
The searcher’s ability, quality & judgment in
digital environments
Service quality
website quality
Literatures used
Cost
Users may experience costs in terms of any payment that
they need to make for system or document access but
the most significant cost is associated with the time that
they expend in searching a system.
Search algorithm, the options for the display of hits, the
seamlessness of the stages in individual systems and
interoperability between systems are important factors
to satisfy an users regardless of materialistic cost.
Information Retrieval Models
An Information Retrieval Model is nothing but a framework
of action process or method of matching information
need and retrieval of information from databases,
knowledge bases and information systems
The goal of information retrieval (IR) is to provide users
with those documents that will satisfy their information
need. We use the word "document" as a general term
that could also include non-textual information, such as
multimedia objects.
According to Marcus (1994) & Marchionini (1992) Information
seeking is a form of problem solving mechanism. It proceeds
according to the interaction among eight sub processes:
i. problem recognition and acceptance,
ii. problem definition,
iii. search system selection,
iv. query formulation,
v. query execution,
vi. examination of results (including relevance feedback),
vii. information extraction, and
viii. reflection/iteration/termination.
Again, To be able to perform effective searches, users have to
develop the following expertise:
i. knowledge about various sources of information,
ii. skills in defining search problems and applying search strategies,
iii. competence in using electronic search tools.
a general overview of the information retrieval process, which has
been adapted from Lancaster and Warner (1993).
The Figure above represents a general model of the
information retrieval process, where both the user's
information need and the document collection have
to be translated into the form of surrogates to enable
the matching process to be performed. This figure
has been adapted from Lancaster and Warner
(1993).
How a general IR Model works
1. Users have to formulate their information need in a
form that can be understood by the retrieval
mechanism
2. Likewise, the contents of large document collections
need to be described in a form that allows the retrieval
mechanism to identify the potentially relevant
documents quickly.
(In both cases, information may be lost in the
transformation process leading to a computer-usable
representation. Hence, the matching process is
inherently imperfect)
3. Once the specified query has been executed by IR system, a
user is presented with the retrieved document surrogates
4. Either the user is satisfied by the retrieved information or he
will evaluate the retrieved documents and modify the query
to initiate a further search. The process of query
modification based on user evaluation of the retrieved
documents is known as relevance feedback.
(Information retrieval is an inherently interactive process, and the
users can change direction by modifying the query surrogate, the
conceptual query or their understanding of their information
need)
5. The results, which have been obtained in studies
investigating the information-seeking process, that describe
information retrieval in terms of the cognitive and affective
symptoms commonly experienced by a library user.
How a general IR Model works
1. Users have to formulate their information need in a form that can be
understood by the retrieval mechanism.
The information need can be understood as forming a pyramid, where only
its peak is made visible by users in the form of a conceptual query. The
conceptual query captures the key concepts and the relationships
among them. It is the result of a conceptual analysis that operates on
the information need, which may be well or vaguely defined in the
user's mind. This analysis can be challenging, because users are faced
with the general "vocabulary problem" as they are trying to translate
their information need into a conceptual query. This problem refers to
the fact that a single word can have more than one meaning, and,
conversely, the same concept can be described by surprisingly many
different words. Further, the concepts used to represent the documents
can be different from the concepts used by the user. The conceptual
query can take the form of a natural language statement, a list of
concepts that can have degrees of importance assigned to them, or it
can be statement that coordinates the concepts using Boolean
operators. Finally, the conceptual query has to be translated into a query
surrogate that can be understood by the retrieval system.
2. Likewise, the contents of large document collections
need to be described in a form that allows the retrieval
mechanism to identify the potentially relevant
documents quickly.
Similarly as the point No.1, the meanings of documents
need to be represented in the form of text surrogates
that can be processed by computer. A typical surrogate
can consist of a set of index terms or descriptors. The
text surrogate can consist of multiple fields, such as the
title, abstract, descriptor fields to capture the meaning
of a document at different levels of resolution or
focusing on different characteristic aspects of a
document.
3. Once the specified query has been executed by IR
system, a user is presented with the retrieved document
surrogates
(i.e. A typical document surrogate can consist of a set of
index terms or descriptors. The text surrogate can
consist of multiple fields, such as the title, abstract,
descriptor fields to capture the meaning of a document)
4. Either the user is satisfied by the retrieved
information or he will evaluate the retrieved
documents and modify the query to initiate a further
search. The process of query modification based on
user evaluation of the retrieved documents is known
as relevance feedback.
Information retrieval is an inherently interactive
process, and the users can change direction by
modifying the query surrogate, the conceptual query
or their understanding of their information need
5. The results, which have been obtained in studies investigating the
information-seeking process, that describe information retrieval in
terms of the cognitive and affective symptoms commonly
experienced by a library user.
Cognitive syndrome like uncertainty, confusion, and frustration are
nearly universal experiences in the early stages of the search
process, and they decrease as the search process progresses and
feelings of being confident, satisfied, sure and relieved increase.
The studies also indicate that cognitive attributes may affect the
search process. User's expectations of the information system and
the search process may influence the way they approach
searching and therefore affect the intellectual access to
information.
The findings by Kuhlthau et al. (1990) indicate that thoughts about
the information need become clearer and more focused as users
move through the search process.
Search or Browsing?
The conceptual query can take the form of a natural
language statement, a list of concepts that can have
degrees of importance assigned to them, or it can be a
statement that coordinates the concepts using Boolean
operators. Finally, the conceptual query has to be
translated into a query surrogate that can be understood
by the retrieval system.
Analytical search strategies require the formulation of
specific, well-structured queries and a systematic,
iterative search for information.
Browsing involves the generation of broad query terms and
a scanning of much larger sets of information in a
relatively unstructured fashion.
Campagnoni et al. (1989) have found in information
retrieval studies in hypertext systems that the
predominant search strategy is "browsing" rather than
"analytical search".
Many users, especially novices, are unwilling or unable to
precisely formulate their search objectives.
Browsing places less cognitive load on them. Furthermore,
research showed that search strategy is only one
dimension of effective information retrieval
Irrespective of any retrieval environment, the following four
main system components must be taken into account in
formulation of the retrieval problem.
a) The objects, documents, or records themselves (which in
the aggregate constitute the information files to be
processed);
b) The information identifiers, terms, index terms, keywords,
attributes, etc. (which characterise the records or
documents and represent the information content in each
case);
c) The information requests (which enter into the system and
are to be compared with the stored records for retrieval);
and
d) The relevance information (often supplied by the users of
the system connecting the information requests to the
stored information items).
MODELS BASED ON INPUT/OUTPUT
On the basis of input and the output, Information
Retrieval Models can be grouped into three basic
categories:
i) Data Retrieval Model
ii) Information Retrieval Model
iii) Knowledge Retrieval Model.
i) Data Retrieval Model
Data retrieval model essentially handles data which may be
taken as unprocessed information or preliminary phase
of information.
Data is an unbiased fact which can be used to form an
information. Here, the expression of information need
should be very precise. For example, population data,
day to day temperature, daily rainfall, transaction
status at ATM, etc.
The data retrieval model is a simple model of information
retrieval needing specific matching techniques.
2. Information Retrieval Model
Information Retrieval Model actually combines several data
into a relational structure of information. Therefore,
relatively it is a more complex model in comparison to
Data Retrieval Model as because It has to comprehend
multi-dimensional relationships amongst data.
It is not amenable easily to a taxonomic structure. The
representation of information is to be based on a
relational data base structure using some associative
mathematics.
The expression of information need is also complex and
time consuming. It draws out for a long conversational or
browsing process and the information retrieval model
must incorporate such facilities and interfaces.
3. Knowledge Retrieval Model
Knowledge is a kind of integration of general types of
information. It normally occurs in the human mind. The
human mind infers and integrates several coordinates
with the information received by it.
So, knowledge is assimilated information. In order to
facilitate decision-making and problem solving,
intelligent knowledge based information retrieval
models are coming up. Such systems comprise three
basic aspects:
i. knowledge base, ii. inference engine, iii. user interface
a) The so-called knowledge base or a store of accumulated
set of rules for converting information into knowledge. It
also incorporates knowledge acquisition system.
b) The second aspect of the system is inference engine. An
inference engine is capable of deriving appropriate
information from the combination of rules for deriving a
synthesized knowledge. This process of deriving is based
on inferential logic using quantitative and non-quantitative
techniques.
c) A user interface, i.e., conversational process in the model
which is capable of receiving information in the
conversation mode and converting it into database signals
for interaction purposes. Thus, a knowledge retrieval
model is a sophisticated model of information processing,
organization and retrieval.
MAJOR IR MODELS
(BASED ON THEORIES AND TOOLS)
1. Boolean Retrieval
1.1 Standard Boolean
1.2 Narrowing and Broadening Techniques
1.3 Smart Boolean Models
1.4 Extended Boolean Models
2. Statistical Model
2.1 Vector Space Model
2.2 Probabilistic Model
2.3 Latent Semantic Indexing
3. Linguistic and Knowledge-based Approaches
3.1 DR-LINK Retrieval System
1.1 Standard Boolean
Boolean logic allows a user to logically relate multiple
concepts together to define what information is needed.
The typical Boolean operators are AND, OR, and NOT.
These operations are implemented using set
intersection, set union and set difference procedures.
A few systems introduced the concept of ‘Exclusive OR’ but
it is not generally useful to users since most users do not
understand it.
1. Standard Boolean
It has the following strengths:
1. It is easy to implement and it is computationally efficient [Frakes and
Baeza-Yates 1992]. Hence, it is the standard model for the current
large-scale, operational retrieval systems and many of the major on-line
information services use it.
2. It enables users to express structural and conceptual constraints to
describe important linguistic features [Marcus 1991]. Users find that
synonym specifications (reflected by OR-clauses) and phrases
(represented by proximity relations) are useful in the formulation of
queries [Cooper 1988, Marcus 1991].
3. The Boolean approach possesses a great expressive power and clarity.
Boolean retrieval is very effective if a query requires an exhaustive and
unambiguous selection.
4. The Boolean method offers a multitude of techniques to broaden or
narrow a query.
5. The Boolean approach can be especially effective in the later stages of the
search process, because of the clarity and exactness with which
relationships between concepts can be represented.
The standard Boolean approach has the following shortcomings:
1. Users find it difficult to construct effective Boolean queries for several reasons
[Cooper 1988, Fox and Koll 1988, Belkin and Croft 1992]. Users are using the
natural language terms AND, OR or NOT that have a different meaning when
used in a query. Thus, users will make errors when they form a Boolean query,
because they resort to their knowledge of English.
2. Only documents that satisfy a query exactly are retrieved. The AND operator is
too severe because it does not distinguish between the case when none of
the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and
four criteria are combined with the Boolean operator AND (referred to as the
Null Output problem). On the other hand, the OR operator does not reflect
how many concepts have been satisfied. Hence, often too many documents
are retrieved (the Output Overload problem).
3) It is difficult to control the number of retrieved documents. Users are often
faced with the null-output or the information overload problem and they are
at loss of how to modify the query to retrieve the reasonable number
documents.
4) The traditional Boolean approach does not provide a relevance ranking of the
retrieved documents, although modern Boolean approaches can make use of
the degree of coordination, field level and degree of stemming present to
rank them [Marcus 1991].
5) It does not represent the degree of uncertainty or error due the vocabulary
problem [Belkin and Croft 1992].
1.2 Narrowing and Broadening Techniques
A Boolean query can be described in terms of the following
four operations:
i. degree and type of coordination,
ii. proximity constraints,
iii. field specifications and
iv. degree of stemming as expressed in terms of
word/string specifications.
If users want to (re)formulate a Boolean query then they
need to make informed choices along these four
dimensions to create a query that is sufficiently broad or
narrow depending on their information needs.
Most narrowing techniques lower recall as well as raise
precision, and most broadening techniques lower
precision as well as raise recall.
Any query can be reformulated to achieve the desired
precision or recall characteristics, but generally it is
difficult to achieve both.
Each of the four kinds of operations in the query
formulation has particular operators, some of which
tend to have a narrowing or broadening effect. For each
operator with a narrowing effect, there is one or more
inverse operators with a broadening effect [Marcus
1991].
Hence, users require help to gain an understanding of how
changes along these four dimensions will affect the
broadness or narrowness of a query.
How the four dimensions affect the broadness or
narrowness of a query is as the following :
1) Coordination: the different Boolean operators AND, OR
and NOT have the following effects when used to add a
further concept to a query: a) the AND operator narrows
a query; b) the OR broadens it; c) the effect of the NOT
depends on whether it is combined with an AND or OR
operator. Typically, in searching textual databases, the
NOT is connected to the AND, in which case it has a
narrowing effect like the AND operator.
2) Proximity: The closer together two terms have to appear
in a document, the more narrow and precise the query.
The most stringent proximity constraint requires the two
terms to be adjacent.
3) Field level: current document records have fields
associated with them, such as the "Title", "Index",
"Abstract" or "Full-text" field: a) the more fields that are
searched, the broader the query; b) the individual fields
have varying degrees of precision associated with them,
where the "title" field is the most specific and the "full-
text" field is the most general.
4) Stemming: The shorter the prefix that is used in
truncation-based searching, the broader the query. By
reducing a term to its morphological stem and using it
as a prefix, users can retrieve many terms that are
conceptually related to the original term [Marcus 1991].
1.3 Smart Boolean
There have been attempts to help users overcome some of
the disadvantages of the traditional Boolean discussed
above. We will now describe such a method,
called Smart Boolean, developed by Marcus [1991, 1994]
that tries to help users construct and modify a Boolean
query as well as make better choices along the four
dimensions that characterize a Boolean query.
We are not attempting to provide an in-depth description
of the Smart Boolean method, but to use it as a good
example that illustrates some of the possible ways to
make Boolean retrieval more user-friendly and effective.
Table 2.2 provides a summary of the key features of the
Smart Boolean approach.
Users start by specifying a natural language statement that is
automatically translated into a Boolean Topic representation.
If the statement is consisted with list of factors or concepts,
then they (factors or concepts) are automatically coordinated
using the AND operator. If the user at the initial stage can or
wants to include synonyms, then they are coordinated using
the OR operator.
Hence, we understand that the Boolean Topic representation
connects the different factors using the AND operator where
the factors can consist of single terms; or several synonyms
connected by the OR operator.
One of the goals of the Smart Boolean approach is to make use
of the structural knowledge contained in the text surrogates,
where the different fields represent into contexts of useful
information. Further, the Smart Boolean approach wants to
use the fact that “related concepts can share a common
stem”. For example, the concepts "computers" and
"computing" have the common stem comput*.
The initial strategy of the Smart Boolean approach is to start out
with the broadest possible query within the constraints of how
the factors and their synonyms have been coordinated. Hence,
it modifies the Boolean Topic representation into the query
surrogate by using only the stems of the concepts and searches
for them over all the fields. Once the query surrogate has been
performed, users are guided in the process of evaluating the
retrieved document surrogates. It also create user feedback
with a list of reasons.
They choose from a list of reasons to indicate why they consider
certain documents as relevant. Similarly, they can indicate why
other documents are not relevant by interacting with a list of
possible reasons. This user feedback is used by the Smart
Boolean system to automatically modify the Boolean Topic
representation or the query surrogate, whatever is more
appropriate. The Smart Boolean approach offers a rich set of
strategies for modifying a query based on the received
relevance feedback or the expressed need to narrow or
broaden the query
Visualizing Boolean Queries through InfoCrystal:
How can we make visualization of Boolean Queries without
limiting its expressive power ?
InfoCrystal can be used to make Boolean retrieval more
transparent and easy-to-use. InfoCrystal make it much
easier for users to formulate and modify Boolean queries
and to achieve the desired retrieval results.
InfoCrystal is nothing but a representation of a specified
Boolean query. Each interior icon of the InfoCrystal
represents a distinct Boolean relationship among the
input criteria , hence, users can specify Boolean queries
by interacting with a direct manipulation interface.
The InfoCrystal acts as a Boolean calculator. Users do not
have to use logical operators and parentheses explicitly
to formulate queries. Hence, users do not have to
concern themselves with the coordination problem.
Instead they need to recognize the relationships of
interest and select them. If an interior icon is selected,
then it changes its visual appearance. In the figures of
this (manipulation) interface, the center area of selected
interior icons are displayed in black and the unselected
ones in white
1.4 Extended (or Weighted) Boolean Models
To address the following issues generally the P-norm and
the Fuzzy Logic approaches that extend the Boolean
model are used.
1) The Boolean operators are too strict and ways need to
be found to soften them.
2) The standard Boolean approach has no provision for
ranking. The Smart Boolean approach and the methods
described in this section provide users with relevance
ranking [Fox and Koll 1988, Marcus 1991].
3) The Boolean model does not support the assignment of
weights to the query or document terms. We will
briefly discuss to address the above issues.
The P-norm method developed by Fox (1983) allows query
and document terms to have weights, which have been
computed by using term frequency statistics with the
proper normalization procedures. These normalized
weights can be used to rank the documents in the order
of decreasing distance from the point (0, 0, ... , 0) for an
OR query, and in order of increasing distance from the
point (1, 1, ... , 1) for an AND query. Further, the Boolean
operators have a coefficient P associated with them to
indicate the degree of strictness of the operator (from 1
for least strict to infinity for most strict, i.e., the Boolean
case). The P-norm uses a distance-based measure and
the coefficient P determines the degree of
exponentiation to be used. The exponentiation is an
expensive computation, especially for P-values greater
than one.
In Fuzzy Set theory, an element has a varying degree of
membership to a set instead of the traditional binary
membership choice. The weight of an index term for a
given document reflects the degree to which this term
describes the content of a document. Hence, this weight
reflects the degree of membership of the document in
the fuzzy set associated with the term in question. The
degree of membership for union and intersection of two
fuzzy sets is equal to the maximum and minimum,
respectively, of the degrees of membership of the
elements of the two sets. In the "Mixed Min and Max"
model developed by Fox and Sharat (1986) the Boolean
operators are softened by considering the query-
document similarity to be a linear combination of the
min and max weights of the documents
Weighting is the process of assigning an importance to an
index term’s use in an item. The weight should represent
the degree to which the concept associated with the
index term is represented in the item. The weight should
help in discriminating the extent to which the concept is
described in items of the database.
The manual process of assigning weights adds additional
overhead on the indexer and requires a more complex
data structure to store the weights.
In a weighted indexing system, an attempt is made to place
a value on the index term’s representation of its
associated concept in the document. An index term’s
weight is based upon a function associated with the
frequency of occurrence of the term in the item.
Typically, values for the index terms are normalised between
zero and one. The higher the weight, the more the term
represents a concept discussed in the item. The weight
can be adjusted to account for other information such as
the number of items in the database that contain the
same concept.
The query process uses the weights along with any weights
assigned to terms in the query to determine a scalar value
(rank value) used in predicting the likelihood that an item
satisfies the query. The results are presented to the user
in order of the rank value from highest number to lowest
number.
Table above summarizes the defining characteristics of the
Extended Boolean approach and list the its key advantages
and disadvantages
If weights are assigned to the terms between the values
0.0 to 1.0, they may be interpreted as the significance
that users are placing on each term. The value 1.0 is
assumed to be the strict interpretation of a Boolean
query. The value 0.0 is interpreted to mean that the user
places little value on the term. Under these
assumptions, a term assigned a value of 0.0 should have
no effect on the retrieved set. Thus:
“A1 OR B0” should return the set of items that
contain A as a term.
“A1 AND B0” will also return the set of items that
contain term A.
“A1 NOT B0” also return set A.
Venn Diagram
Under the strict interpretation “A1 OR B1” would include all
items that are in all the areas in the Venn diagram. “A1 OR
B0” would be only those items in A (i.e., the green and
Blue shaded areas) which is everything except items in “B
NOT A” (the Blue area).
Thus, as the value of query term B goes from 0.0 to 1.0,
items from “B NOT A” are proportionally added until at 1.0
all of the items will be added.
Similarly, under the strict interpretation “A1 AND B1” would
include all of the items that are in the green and Blue
shaded areas. “A1 AND B0” will be all of the items in A as
described above. Thus, as the value of query term B goes
from 1.0 to 0.0 items will be proportionally added from “A
NOT B” (Green area) until at 0.0 all of the items will be
added.
Finally, the strict interpretation of “A1 NOT B1” is Green
area while “A1 NOT B0” is all of A. Thus as the value of B
goes from 0.0 to 1.0, items are proportionally added
from “A AND B” (green and Blue shaded area) until at
1.0 all of the items have been added.
The final issue here is the determination of which items
are to be added or dropped in interpreting the weighted
values.
2. Statistical Model
The vector space and probabilistic models are the two
major examples of the statistical retrieval approach. Both
models use statistical information in the form of term
frequencies to determine the relevance of documents
with respect to a query. Although they differ in the way
they use the term frequencies, both produce as their
output a list of documents ranked by their estimated
relevance. The statistical retrieval models address some
of the problems of Boolean retrieval methods, but they
have disadvantages of their own.
Statistical Model
1. Vector Space Model
2. Probabilistic Model
3. Latent Semantic Indexing
2.1 Vector Space Model
• Vector space model or term vector model is an
algebraic/statistical model for representing text
documents (and any objects, in general) as vectors of
identifiers, such as, for example, index terms. It is used
in information filtering, information retrieval, indexing and
relevancy rankings.
• The Vector Space Model (VSM) is a way of representing
documents through the words that they contain.
• The VSM allows decisions to be made about which
documents are similar to each other to keyword queries
In the Vector Space Model or system, emphasis is given in
the weights as a foundation for information detection and
stores these weights in a vector form.
In systems based upon a vector model, the semantics of
every item are represented as a vector.
What is a Vector?
A vector is a one-dimensional set of values, where the
order/position of each value in the set is fixed and
represents a particular domain. Each vector represents a
document and each position in a vector represents a
different unique word to represent the document in the
database.
There are two approaches to the domain of values in the
vector – binary and weighted
Binary: represents document (processing token) by 1 or 0
1 representing the existence of the processing
token in the item.
0 representing the non-existence of the processing
token in the item
Weighted: represents document by keywords with set of
all real positive numbers. The value assigned to
each position is the weight of that term in the
document. A value of zero indicates that the word
is not in the document
Queries can be translated into the vector form. Search is
accomplished by calculating the distance between the
query vector and the document vector. The use of
weights also provides a basis for determining the rank of
an item.
The vector approach allows for a mathematical and a
physical representation using a vector space model.
1. Vector Space Model
If a query (q) is considered to be a line in an imaginary
space and the document (d) is also considered to be a
line in the imaginary space, the geometrically
determined angle between the two lines can be
understood as measuring the degree to which the
documents are similar to the query. While in the case of
a large angle the document is presumed to be dissimilar
to the query, in the case of a very small angle the
document is presumed to be highly similar to the
question.
How the Vector Space Model indexing procedure works?
The Vector Space Model procedure can be divided into
three stages:
The first stage is the document indexing where the content
bearing terms are extracted from the document text. It is
obvious that many of the words in a document do not
describe the content, like, the, is, are, in, to, of, etc.
These are called non-significant words or stop words. In
case of automatic indexing, these terms are removed
from the document vector, so the document will only be
represented by the content-bearing terms. In general,
40-50% of the total number of words, in a document, are
stop words. These can be removed with the help of a
stop word list.
The second stage is the weighting of the indexed terms to
enhance retrieval of document relevant to the user.
The last stage ranks the document with respect to the
query according to a similarity measure.
Documents and queries are represented as vectors.
dj = (w1,j, w2,j, ……, wt,j)
qj = (w1,q, w2,q, ……, wn,q)
Each dimension corresponds to a separate term. If a term occurs in
the document, its value in the vector is non-zero.
Several different ways of computing these values, also known as
(term) weights, have been developed. One of the best known
schemes is tf-idf (term frequency–inverse document frequency)
weighting.
The definition of term depends on the application (i.e. whether
article, books, etc). Typically terms are single words, keywords, or
longer phrases. If words are chosen to be the terms, the
dimensionality of the vector is the number of words in the
vocabulary (the number of distinct words occurring in
the corpus).
Vector operations can be used to compare documents with queries.
Relevance rankings of documents in a keyword search can
be calculated, using the assumptions of document
similarities theory, by comparing the deviation of angles
between each document vector and the original query
vector where the query is represented as a vector with
same dimension as the vectors that represent the other
documents.
In practice, it is easier to calculate the cosine of the angle
between the vectors, instead of the angle itself:
The VSM is contrary to the Boolean Retrieval Model in which
retrieval is based on the hundred percent (exact) match. The VSM
allows retrieval of the most similar to the query without the exact
match. Thus, the VSM can be well explained in terms of keyword-
by-document matrix (A), in which the rows correspond to
keywords (W) in the database and the columns correspond to
documents (D), then the matrix will be like:
D1 D2 D3 D4 ….. Dn
W1 A11 A12 A13 A14 ….. A1n
W2 A21 A22 A23 A24 ….. A2n
A = W3 A31 A32 A33 A34 ….. A3n
W4 A41 A42 A43 A44 ….. A4n
..... …. …. …. …. ….. ….
Wm Am1 Am2 Am3 Am4 ….. Amn
Let us take a hypothetical example, like, an information seeker
searches information on “Education information retrieval
system”.
He uses four keywords: W1, W2, W3, and W4.
After searching the database,
he gets six articles: A1, A2, A3, A4, A5, and A6.
After analysis, it is found that the
Article A1 talks only about W1;
Article A2 discusses 33% topic of W2 and 67% of W4;
Article A3 deals with 20% of W1, 30% of W3 and 50% of W4;
Article A4 deals with 60% of W1, 10% of W2 and 30% of W4;
Article A5 talks 80% about W2 and 20% about W3;
Article A6 discusses only about W4.
Now this can be denoted in the form of a 4X6 matrix as below:
A1 A2 A3 A4 A5 A6
W1 1.00 0.00 0.20 0.60 0.00 0.00
W2 0.00 0.33 0.00 0.10 0.80 0.00
A = W3 0.00 0.00 0.30 0.00 0.20 0.00
W4 0.00 0.67 0.50 0.30 0.00 1.00
The VSM is a retrieval model which constitutes a fairly large class of retrieval
methods, each consisting of an indexing method and a retrieval function, The
indexing method generates description vectors, and the retrieval function
generates retrieval status values by comparing the query description vector
with the document description vectors.
The information seeker is assumed to have information need, which he formulates
as a query. The query q and the document dj are indexed in two steps.
First appropriate indexing features are spotted in the query q and in the document
dj.
Secondly, these features are assigned weights to obtain the query description and
the document descriptions are sets of weighted indexing features. These are
called document description vector and query vector. The query description
and document descriptions are matched and a score is generated for every
document pair. These scores are called Retrieval Status Values (RSVs). For every
query, the documents are presented to the information seeker in descending
order of these RSVs.
Each keyword in a document collection forms document vector
which represents the single or multiple occurrences of the term
i in document d.
Similarly, a query is represented by a query vector which denotes
the number of occurrences of terms in the query.
Both the document vector and query vector provide the locations
of the objects in the term-document space. There are two
common one-dimensional measures that every vector has,
length and angle with respect to a fixed point. The angle
between two vectors refers to the measure in degrees between
those two vectors. The document vector whose angle is closest
to the query vector’s angle is the best choice, yielding the
document most closely related to the query. It is measured in
terms of cosine angle between the two vectors. If the cosine of
the angle is 1, then the angle between the document vector and
the query vector measures 0 degree, meaning the document
vector and the query vector move in the same direction. A
cosine measure of 0 would mean the document is unrelated to
the query vector. Thus, a cosine measure close to 1 means that
the document is closely related to the query.
d2 . q
= -------------------
||d2|| ||q||
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval
Information storage and  retrieval

More Related Content

What's hot

Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
National information policy
National information policyNational information policy
National information policy
Simhachalam సింహాచలం Naidu
 
Library Automation in Circulation
Library Automation in Circulation Library Automation in Circulation
Library Automation in Circulation
Murchana Borah
 
Planning and implementation of library automation by Aman Kumar Kushwaha
Planning and implementation of library automation by Aman Kumar KushwahaPlanning and implementation of library automation by Aman Kumar Kushwaha
Planning and implementation of library automation by Aman Kumar Kushwaha
AMAN KUMAR KUSHWAHA
 
Canons of library classification
Canons of library classificationCanons of library classification
Canons of library classification
Govt. P.G. College Sendhwa, Barwani (M.P.)
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Subject cataloging
Subject catalogingSubject cataloging
Subject cataloging
Ime Amor Mortel
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
Debashisnaskar
 
NISCAIR.pptx
NISCAIR.pptxNISCAIR.pptx
NISCAIR.pptx
DrIrfanulHaqAkhoon
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
SUNILKUMARSINGH
 
Use and user study
Use and user study Use and user study
Use and user study
Shubhada Nagarkar
 
Reference services in Libraries
Reference services in LibrariesReference services in Libraries
Reference services in Libraries
Government of India
 
Virtual reference srevices
Virtual reference srevicesVirtual reference srevices
Virtual reference srevices
iqra Mubeen
 
Role of Library in Modern Society.pptx
Role of Library in Modern Society.pptxRole of Library in Modern Society.pptx
Role of Library in Modern Society.pptx
Shamim Aktar
 
CANONS OF CATALOGUING ppt
CANONS OF CATALOGUING pptCANONS OF CATALOGUING ppt
CANONS OF CATALOGUING ppt
University of Delhi
 
Library automation software
Library automation softwareLibrary automation software
Library automation software
Jancypriya M
 
Library Classification
Library ClassificationLibrary Classification
Comparative study of major classification schemes
Comparative study of major classification schemesComparative study of major classification schemes
Comparative study of major classification schemes
Nadeem Nazir
 

What's hot (20)

Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
National information policy
National information policyNational information policy
National information policy
 
Library Automation in Circulation
Library Automation in Circulation Library Automation in Circulation
Library Automation in Circulation
 
Planning and implementation of library automation by Aman Kumar Kushwaha
Planning and implementation of library automation by Aman Kumar KushwahaPlanning and implementation of library automation by Aman Kumar Kushwaha
Planning and implementation of library automation by Aman Kumar Kushwaha
 
Canons of library classification
Canons of library classificationCanons of library classification
Canons of library classification
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
UNISIST
UNISISTUNISIST
UNISIST
 
Subject cataloging
Subject catalogingSubject cataloging
Subject cataloging
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
 
NISCAIR.pptx
NISCAIR.pptxNISCAIR.pptx
NISCAIR.pptx
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 
Use and user study
Use and user study Use and user study
Use and user study
 
Reference services in Libraries
Reference services in LibrariesReference services in Libraries
Reference services in Libraries
 
Virtual reference srevices
Virtual reference srevicesVirtual reference srevices
Virtual reference srevices
 
Role of Library in Modern Society.pptx
Role of Library in Modern Society.pptxRole of Library in Modern Society.pptx
Role of Library in Modern Society.pptx
 
CANONS OF CATALOGUING ppt
CANONS OF CATALOGUING pptCANONS OF CATALOGUING ppt
CANONS OF CATALOGUING ppt
 
Library automation software
Library automation softwareLibrary automation software
Library automation software
 
Library Classification
Library ClassificationLibrary Classification
Library Classification
 
Uniterm indexing
Uniterm indexing Uniterm indexing
Uniterm indexing
 
Comparative study of major classification schemes
Comparative study of major classification schemesComparative study of major classification schemes
Comparative study of major classification schemes
 

Similar to Information storage and retrieval

4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
Bahria University Islamabad, Pakistan
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Bahria University Islamabad, Pakistan
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Bahria University Islamabad, Pakistan
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Bahria University Islamabad, Pakistan
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
Urm concept for sharing information inside of communities
Urm concept for sharing information inside of communitiesUrm concept for sharing information inside of communities
Urm concept for sharing information inside of communities
Karel Charvat
 
Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)
Sumit Kumar Gupta
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
Bahria University Islamabad, Pakistan
 
DBMS Full book by Muhammad Sharif title as Database systems Handbook.pdf
DBMS Full book by Muhammad Sharif  title as Database systems Handbook.pdfDBMS Full book by Muhammad Sharif  title as Database systems Handbook.pdf
DBMS Full book by Muhammad Sharif title as Database systems Handbook.pdf
Bahria University Islamabad, Pakistan
 
Muhammad Sharif dbms book title as Database systems Handbook.pdf
Muhammad Sharif dbms book title as Database systems Handbook.pdfMuhammad Sharif dbms book title as Database systems Handbook.pdf
Muhammad Sharif dbms book title as Database systems Handbook.pdf
Bahria University Islamabad, Pakistan
 

Similar to Information storage and retrieval (20)

4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
 
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Urm concept for sharing information inside of communities
Urm concept for sharing information inside of communitiesUrm concept for sharing information inside of communities
Urm concept for sharing information inside of communities
 
Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)
 
Mam assign
Mam assignMam assign
Mam assign
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
 
Database systems Handbook.pdf
Database systems Handbook.pdfDatabase systems Handbook.pdf
Database systems Handbook.pdf
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 
DBMS Full book by Muhammad Sharif title as Database systems Handbook.pdf
DBMS Full book by Muhammad Sharif  title as Database systems Handbook.pdfDBMS Full book by Muhammad Sharif  title as Database systems Handbook.pdf
DBMS Full book by Muhammad Sharif title as Database systems Handbook.pdf
 
Muhammad Sharif dbms book title as Database systems Handbook.pdf
Muhammad Sharif dbms book title as Database systems Handbook.pdfMuhammad Sharif dbms book title as Database systems Handbook.pdf
Muhammad Sharif dbms book title as Database systems Handbook.pdf
 

More from Dr. Utpal Das

Metrics h-Index, g-Index, Altmetrics.pptx
Metrics h-Index, g-Index, Altmetrics.pptxMetrics h-Index, g-Index, Altmetrics.pptx
Metrics h-Index, g-Index, Altmetrics.pptx
Dr. Utpal Das
 
Citation Database
Citation Database Citation Database
Citation Database
Dr. Utpal Das
 
Plagiarism and its relevance in academics.pptx
Plagiarism and its relevance in academics.pptxPlagiarism and its relevance in academics.pptx
Plagiarism and its relevance in academics.pptx
Dr. Utpal Das
 
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
Dr. Utpal Das
 
How to avoid plagiarism while thesis writing.pptx
How to avoid plagiarism while thesis writing.pptxHow to avoid plagiarism while thesis writing.pptx
How to avoid plagiarism while thesis writing.pptx
Dr. Utpal Das
 
Role of College Libraries in meeting user’s information needs issues and chal...
Role of College Libraries in meeting user’s information needs issues and chal...Role of College Libraries in meeting user’s information needs issues and chal...
Role of College Libraries in meeting user’s information needs issues and chal...
Dr. Utpal Das
 
Avoiding plagiarism in this era of digital availability
Avoiding plagiarism in this era of digital availabilityAvoiding plagiarism in this era of digital availability
Avoiding plagiarism in this era of digital availability
Dr. Utpal Das
 
Plagiarism in HEI and how to avoid it
Plagiarism in HEI and how to avoid it Plagiarism in HEI and how to avoid it
Plagiarism in HEI and how to avoid it
Dr. Utpal Das
 
Confronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarismConfronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarism
Dr. Utpal Das
 
Confronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarismConfronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarism
Dr. Utpal Das
 
Truth, fact and ethics in academic research
Truth, fact and ethics in academic researchTruth, fact and ethics in academic research
Truth, fact and ethics in academic research
Dr. Utpal Das
 
Ethics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarismEthics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarism
Dr. Utpal Das
 
Success and growth of Dibrugarh University Library during new normal
Success and growth of Dibrugarh University Library during new normalSuccess and growth of Dibrugarh University Library during new normal
Success and growth of Dibrugarh University Library during new normal
Dr. Utpal Das
 
Information seeking and information use behaviour in libraries
Information seeking  and information use behaviour in librariesInformation seeking  and information use behaviour in libraries
Information seeking and information use behaviour in libraries
Dr. Utpal Das
 
Information literacy
Information literacyInformation literacy
Information literacy
Dr. Utpal Das
 
Chemical factors of deterioration of documents
Chemical factors of deterioration of documentsChemical factors of deterioration of documents
Chemical factors of deterioration of documents
Dr. Utpal Das
 
Remedies for biological deterioration of wood origin documentary heritage
Remedies for biological deterioration of wood origin documentary heritageRemedies for biological deterioration of wood origin documentary heritage
Remedies for biological deterioration of wood origin documentary heritage
Dr. Utpal Das
 
Definition, factors and actions of preservation of Manuscripts
Definition, factors and actions of preservation of ManuscriptsDefinition, factors and actions of preservation of Manuscripts
Definition, factors and actions of preservation of Manuscripts
Dr. Utpal Das
 
Manuscripts: Concept, Importance and History of manuscripts in Assam
Manuscripts: Concept, Importance and History of manuscripts in AssamManuscripts: Concept, Importance and History of manuscripts in Assam
Manuscripts: Concept, Importance and History of manuscripts in Assam
Dr. Utpal Das
 
Indexing language concept types and characteristics
Indexing language concept types and characteristicsIndexing language concept types and characteristics
Indexing language concept types and characteristics
Dr. Utpal Das
 

More from Dr. Utpal Das (20)

Metrics h-Index, g-Index, Altmetrics.pptx
Metrics h-Index, g-Index, Altmetrics.pptxMetrics h-Index, g-Index, Altmetrics.pptx
Metrics h-Index, g-Index, Altmetrics.pptx
 
Citation Database
Citation Database Citation Database
Citation Database
 
Plagiarism and its relevance in academics.pptx
Plagiarism and its relevance in academics.pptxPlagiarism and its relevance in academics.pptx
Plagiarism and its relevance in academics.pptx
 
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
Understanding IPR and Copyright Law Presentation Jorhat Kendriya Mahavidyalay...
 
How to avoid plagiarism while thesis writing.pptx
How to avoid plagiarism while thesis writing.pptxHow to avoid plagiarism while thesis writing.pptx
How to avoid plagiarism while thesis writing.pptx
 
Role of College Libraries in meeting user’s information needs issues and chal...
Role of College Libraries in meeting user’s information needs issues and chal...Role of College Libraries in meeting user’s information needs issues and chal...
Role of College Libraries in meeting user’s information needs issues and chal...
 
Avoiding plagiarism in this era of digital availability
Avoiding plagiarism in this era of digital availabilityAvoiding plagiarism in this era of digital availability
Avoiding plagiarism in this era of digital availability
 
Plagiarism in HEI and how to avoid it
Plagiarism in HEI and how to avoid it Plagiarism in HEI and how to avoid it
Plagiarism in HEI and how to avoid it
 
Confronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarismConfronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarism
 
Confronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarismConfronting ethical issues in research for avoiding plagiarism
Confronting ethical issues in research for avoiding plagiarism
 
Truth, fact and ethics in academic research
Truth, fact and ethics in academic researchTruth, fact and ethics in academic research
Truth, fact and ethics in academic research
 
Ethics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarismEthics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarism
 
Success and growth of Dibrugarh University Library during new normal
Success and growth of Dibrugarh University Library during new normalSuccess and growth of Dibrugarh University Library during new normal
Success and growth of Dibrugarh University Library during new normal
 
Information seeking and information use behaviour in libraries
Information seeking  and information use behaviour in librariesInformation seeking  and information use behaviour in libraries
Information seeking and information use behaviour in libraries
 
Information literacy
Information literacyInformation literacy
Information literacy
 
Chemical factors of deterioration of documents
Chemical factors of deterioration of documentsChemical factors of deterioration of documents
Chemical factors of deterioration of documents
 
Remedies for biological deterioration of wood origin documentary heritage
Remedies for biological deterioration of wood origin documentary heritageRemedies for biological deterioration of wood origin documentary heritage
Remedies for biological deterioration of wood origin documentary heritage
 
Definition, factors and actions of preservation of Manuscripts
Definition, factors and actions of preservation of ManuscriptsDefinition, factors and actions of preservation of Manuscripts
Definition, factors and actions of preservation of Manuscripts
 
Manuscripts: Concept, Importance and History of manuscripts in Assam
Manuscripts: Concept, Importance and History of manuscripts in AssamManuscripts: Concept, Importance and History of manuscripts in Assam
Manuscripts: Concept, Importance and History of manuscripts in Assam
 
Indexing language concept types and characteristics
Indexing language concept types and characteristicsIndexing language concept types and characteristics
Indexing language concept types and characteristics
 

Recently uploaded

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 

Recently uploaded (20)

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 

Information storage and retrieval

  • 1. INFORMATION STORAGE AND RETRIEVAL SYSTEM Dr. Utpal Das Dibrugarh University, Dibrugarh, Assam utpalishaan@gmail.com
  • 2. Break up of Terminology INFORMATION /STORAGE/ RETRIEVAL /SYSTEM
  • 4. MEDIA DATABASES: Bibliographic Full Text STORAGE stand-alone databases hypertext networked databases SYSTEM DBMS CLASSIFICATION SCHEMES INDEXES Books, Journals, Articles, Audio, Video, Cartographs Text, Sound, Image, Data
  • 6. System Mechanism Framework Mode of Arrangement Interconnected Network A set of Principle or Procedure Organized scheme or Method Modus Operandi
  • 7. Genesis The term “Information Retrieval System” was coined by Calvin Mooers in 1952. IRS gained popularity in the research community in the early sixties only when computers were being introduced in information handling and management. These information retrieval systems are basically nothing but document retrieval system, since they were designed to retrieve bibliographic information of stored documents databases in response to a search request by the users.
  • 8. Genesis Though the basics of IRS is still the same, due to application of present advanced techniques , the role and scope of IRS has been much widened. Therefore the connotation of information retrieval has changed and it has been variously termed by information professionals and researchers, like: Information Storage and Retrieval System, Information Organization and Retrieval System, Information Processing and Retrieval System, Text Retrieval System, Information Representation and Retrieval System, Information Access System.
  • 9. Genesis The modern connotations implies that IRS presently deals not only with textual information but also with multimedia information comprising text, audio, images and video. While many features of conventional text retrieval systems are equally applicable to multimedia information retrieval, the specific nature of audio, image and video information have called for the development of many new tools and techniques for information retrieval. Thus, modern information retrieval systems deal with storage, organization and access to text, as well as multimedia information resources.
  • 10. Meaning, Definition and Concept of ISRS  ISRS is a selective, systematic recall of logically stored information  ISRS is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand- alone databases or hypertext networked databases such as the Internet or World Wide Web or intranets, for text, sound, images or data
  • 11. Meaning, Definition and Concept of ISRS  An ISRS is an information system, that is, a system used to store items of information that need to be processed, searched, retrieved, and disseminated to various user populations  It is a process of searching some collection of documents, using the term document in its widest sense, in order to identify those documents which deal with a particular subject. Any system that is designed to facilitate this literature searching may legitimately be called an information retrieval system.
  • 12. Meaning, Definition and Concept of ISRS ISRS is the study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms Information retrieval may be defined as the technique and process of searching, recovering, and interpreting information from large amounts of stored data. It is recovery of information, especially in a database stored in a computer
  • 13. Meaning, Definition and Concept of ISRS IR is essentially concerned with structure and operation for devices to select the documentary information and response to search query IRS does not inform the user on (change the knowledge of) subject of his enquiry, it merely inform him of the existence or non existence and where about of document relating to his request.
  • 14. Meaning, Definition and Concept of ISRS An information retrieval system is designed to analyse, process and store sources of information and retrieve those that match a particular user’s requirements [Chowdhury, G.G. (2004). Introduction to modern information retrieval. 2nd ed. London: Facet Publishing. 2004].
  • 15. Meaning, Definition and Concept of ISRS Basic aspects of ISRS: Information Storage and Retrieval (ISAR) system deals with three basic aspects: Information representation Information storage and organisation Information access.
  • 16. Meaning, Definition and Concept of ISRS BROAD OUTLINE Information sources Analysis & Representation Organised Information Retrieved Information Matching Users Query Analysis Analysed Queries
  • 17. Meaning, Definition and Concept of ISRS Functional View of Standard IR System
  • 18.
  • 19. CHARACTERISTICS OF ISAR SYSTEMS Information Facilitator The ISAR system should act as facilitator between the information (contained in document) and the users. If a user approaches with the subject term, name of contributors or title of the document and so on, the system should be helpful to give him the desired information. The information could be exact information or the reference of a document which contains information
  • 20. CHARACTERISTICS OF ISAR SYSTEMS Non-Ambiguous The system should be so organized that ambiguity of information is avoided so that search result is free from any kind of ambiguity. This requires identification of terms, setting their context and their proper indexing. For example, search for a term ‘screw driver’ should not bring results like ‘truck driver’, ‘hardware driver’ and so on.
  • 21. OBJECTIVES OF ISAR SYSTEMS Minimum Time The system should be so designed that minimum effort and time are spent to interrogate the system. Searching through the system should take minimum time, meaning thereby that the ISAR should be capable of performing fast search. Not only that, it is best to have an online ISAR so that users do not need to walk to library. They should get whatever they want at there work place.
  • 22. OBJECTIVES OF ISAR SYSTEMS User Friendliness Ease of use is an important consideration for any ISAR system. Any ISAR should have user friendly interface. The important aspects of ISAR should be highlighted. Before a user uses the system he/she should be properly introduced to the system with all its features, i.e., informing users about the scope of system, available search options, and most importantly how to perform search with the system. It is only this interface through which a user operates an ISAR system. Take an example of a Library OPAC. It should have following features: Introduction to library Scope of collection Instructions for performing search
  • 23. CHARACTERISTICS OF ISAR SYSTEMS User Friendliness The search interface should facilitate framing the search like: Keyword search Author and title search Combination search (using Boolean operators) Proximity search, etc.
  • 24. CHARACTERISTICS OF ISAR SYSTEMS Others The desirability of making systems as readily usable as possible for their clienteles The need to recognise basic features of retrieval system To incorporate coordinating features such as vocabulary control, search strategies, user-interface, information modelling aspects in general, etc.
  • 25. CHARACTERISTICS OF ISAR SYSTEMS The competence and compatibility for consolidated searching and retrieval of information from any client terminal from any database within the system. It should be able to narrowcast or broadcast or relate the information need in a variety of associations to get optimum retrieval performance. It should have access facilities at multi-points. It should have common command language facility to retrieve information from several databases of the system
  • 26. CHARACTERISTICS OF ISAR SYSTEMS It should be able to handle information access from entity- related or object-oriented approaches. It may also provide all other associations for accessing information. In a bibliographic or full-text database, the surrogates chosen should have indicative as well as informative features that are sufficient enough to select or reject the retrieving information based on end-users’ needs. It should have the ability to select, classify, process and consolidate the analysed information into a cohesive text ready for assimilation by the end-users.
  • 27. CHARACTERISTICS OF ISAR SYSTEMS It should have ability to orient the information to specialist needs of the users from time to time. This calls for understanding the processing of user profiles. It should be able to retrieve maximum information with minimum number of clues. The fuzzy approaches of end-users must be able to get clarified and ultimate result should provide satisfaction to the searcher. It should have capacity to interchange the information available in one database or another for purposes of retrieval relevance end usage.
  • 28. CHARACTERISTICS OF ISAR SYSTEMS It should have bibliographic data interchange capacity (using Z39.50 or similar standard) to meet consolidation to a chosen format for networking and other purposes. Compatibility with standards at all levels must be the goal. It should have ability to search simple information quickly in an easy manner and also have the ability to multi- track the complex questions and present them in a simple easy manner. User-friendly presentations are very important.
  • 29. FUNCTIONS To identify the information (sources) relevant to the areas of interest of the target user’s community; this is a challenging job especially in the web environment where virtually everybody in the world can be the potential user of a web based information retrieval system. To analyse the contents of the sources (documents); this is becoming increasingly challenging as the size, volume and variety of information sources (documents) is increasing rapidly; web information retrieval is carried out automatically using specially designed programs called spiders.
  • 30. FUNCTIONS To represent the contents of analysed sources in a way that matches users’ queries; this is done by automatically creating one or more index files, and is becoming an increasingly complex task due to the volume and variety of content and increasing user demands. To analyse users’ queries and represent them in a form that will be suitable for matching the database; this is done in a number of ways, through the design of sophisticated search interfaces including those that can provide some help to users for selection of appropriate search terms by using dictionary and thesauri, automatic spell checkers, a predefined set of search statements and so forth.
  • 31. FUNCTIONS To match the search statement with the stored database; a number of complex information retrieval models have been developed over the years that are used to determine the similarity of the query and stored documents. To retrieve relevant information; a variety of tools and techniques are used to determine the relevance of retrieved items and their ranking. To make continuous changes in all aspects of the system, keeping in mind the rapid developments in information and communication technologies (ICTs) relating to changing patterns of society, users and their information needs and expectations.
  • 32. Design of Information Retrieval System To design and develop an ISAR system one needs to recognize the need of the users as all the subsequent activities are dependent upon these. When designing, ISAR systems should follow system development life cycle (SDLC) for greater efficiency and effectiveness of the systems.
  • 33. System Development Life Cycle Phases:
  • 34. 1. System Planning: i. Defining the problems, ii. Objectives and need iii. Resources (such as personnel and costs). After analyzing data for planning one will have three choices: Develop a new system, Improve the current system or leave the system as it is.
  • 35. 2. System Analysis: i. Determining end-user’s requirements, ii. Their expectations from the system, iii. Performance of the System iv. Feasibility study 3. System Design: i. Elements of a system, ii. Components, iii. Security level, iv. Modules, v. Architecture vi. Interfaces vii. Type of data (system design meets all functional and technical requirements, logically and physically)
  • 36. 4. Implementation and Deployment i. it’s the actual construction process ii. In Software Development Life Cycle, the actual code is written here iii. In Hardware Development Life Cycle, the implementation phase will contain configuration and fine-tuning iv. System becomes ready to become running, live and productive
  • 37. 5. System Testing and Integration i. Introducing the system to different inputs ii. obtaining its outputs and analyze behavior iii. Observing the way it functions (Testing is important to ensure customer’s satisfaction, and it requires no knowledge in coding, hardware configuration or design) 6. System Maintenance i. periodic maintenance to prevent redundancy ii. Replacing the old hardware iii. Periodical evaluation of system’s performance, iv. latest updates for certain components with latest technologies to face current security threats.
  • 38. Steps for Design of Information Retrieval System Steps for designing an Information Retrieval System: i. Recognizing the need for development of ISAR system ii. Recognizing the information needs of the users iii. Identification of users need iv. Type(s) of databases to be incorporated into the system v. Features to be incorporated in the databases vi. Preparation of structured queries vii. Design and development of various components of the system such as user interface, search agent, etc. viii. Evaluation of the system ix. Re-designing/Modification of ISAR system, if needed.
  • 39. Need & Purpose The basic purpose of ISRS is the satisfy information needs of various classes of Users: a) Current Information Need, b) Exhaustive Information Need, c) Every day Information Need, and d) Catching-up or Brushing-up Information Need
  • 40. Need & Purpose An IRS is designed to retrieve the documents or information required by the user community. It should make the right information available to the right user. Thus, an information retrieval system aims to collect and organize information in one or more subject areas in order to provide it to users as soon as they ask for it. A writer presents a set of ideas in a document using a set of concepts.
  • 41. Need & Purpose Somewhere there are users who require the ideas but may not be able to identify them; in other words , some people lack the ideas put forward by the author in their work. IRS match the writer’s ideas expressed in the document with the user’s requirements for them. Thus, an IRS serves as a bridge between the world of creators or generators of information and the users of that information.
  • 42. Components for Design of ISRS An ISAR system has 3 basic components: I. User Interface II. Knowledge Base III. Search Agent
  • 43. Components for Design of ISRS I. User Interface: User interface is the front page or the front-end or (User’s) operational area of the system which enables user to put a query and displays results. It is of two types: i. Query Interface ii. Result Interface
  • 44. i. Query Interface: This is the end from where users enter his/her search terms and initiate communication with the system. The Query Interface generally need to have following features: a) Understanding the user input statement This front-end interface needs to understand the keywords given by the users and capture them to pass on to the search program. The front-end should have understandable look and feel, distinguishable colour combinations, and search instructions.
  • 45. b) Refining the problem statement The interface should have ability or flexibility for further refining any query or statement, narrow down from broader to specific search or further modification within the displayed search results with some kind of arrangement among topical terms which further facilitate browsing through the system. c) Search statement to search strategy translation The system front-end should have the ability to translate a search statement and formulate a search strategy in the programming language which is understood by Search Agent. For example, interfaces built in a Relational Database Management System (RDBMS) environment, accepts search statement in Structured Query Language (SQL) format and formulate the search strategy with the help of Search Agent (like Boolean Operators or any other algorithms) .
  • 46. d) Modification of search strategy If one does not get desired output from the database, ISAR system should have procedure for further modification of search strategy. The modification should be interactive. Vocabulary control devices can also be added as an aid for users to locate the term of his/her interest. For Example: Modifying search with the help of other options like ‘Contains’, ‘Exact’, ‘Begins with’, ‘Ends with’, etc.
  • 47. ii. Result Interface In the Result Interface, display of search results should be user friendly. Not only that the result should cater the needs of individual users but the display should also be customized (like e-resource publishers interface). Search results should also display the ratings in the light of search terms. For this purpose statistical techniques can be used.
  • 48. Components for Design of ISRS II. Knowledge Base The store house of any ISAR system is its Knowledge Base. It contains list of facts or related facts (information). Any kind of query is answered based on the facts stored in the Knowledge Base. A Knowledge Base could be a Database Management System (DBMS). knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. A knowledge-based system consists of a knowledge-base that represents facts about the world and an inference engine that can reason about those facts and use rules and other forms of logic to deduce new facts or highlight inconsistencies
  • 49. Retrieval of information from storage depends on two important aspects of Knowledge Base: A. Knowledge Representation B. Indexing and Clustering
  • 50. A. Knowledge Representation: The first and foremost objective in constructing an ISAR system is representation of facts within the Knowledge Base. There are different ways of representation of knowledge: a) Semantic Network Knowledge Representation b) Frame Based Knowledge Representation c) Rule-Based Knowledge Representation
  • 51. a) Semantic Network Knowledge Representation Semantic network is a method of knowledge representation based on a network structure. A semantic network contains points called nodes connected by links called as arcs. The nodes represent objects, concepts or events - in other words documents or information. The arcs are used to represent the relations between the nodes. Arcs build a kind of hierarchies in the Knowledge Base. Arcs usually represent relations like is_a or has_part. Semantic networks are useful in representation of sentences of natural language.
  • 52. Semantics is the linguistic and philosophical study of meaning, in language, programming languages, formal logics, and semiotics. It is concerned with the relationship between signifiers— like words, phrases, signs, and symbols—and what they stand for in reality, their denotation.
  • 53.
  • 54. In LISP Programming Language: (setq *database* '((canary (is-a bird) (color yellow) (size small)) (penguin (is-a bird) (movement swim)) (bird (is-a vertebrate) (has-part wings) (reproduction egg-laying))))
  • 55. Also, setq can be used to assign different values to different variables. The first argument is bound to the value of the second argument, the third argument is bound to the value of the fourth argument, and so on. For example, you could use the following to assign a list of trees to the symbol trees and a list of herbivores to the symbol herbivores: (setq trees '(pine fir oak maple) herbivores '(gazelle antelope zebra))
  • 56. To set the value of the variable carnivores to the list '(lion tiger leopard) using setq, the following expression is used: (setq carnivores '(lion tiger leopard)) This is exactly the same as using set except the first argument is automatically quoted by setq. (The ‘q’ in setq means quote.) With set, the expression would look like this: (set 'carnivores '(lion tiger leopard))
  • 57. Complexity in Semantic Network Knowledge Representation The idea of semantic networks started out as a natural way to represent labelled connections between entities. But, as the representations are expected to support increasingly large ranges of problem solving tasks, the representation schemes necessarily become increasingly complex In particular, it becomes necessary to assign more structure to nodes, as well as to links. For example, in many cases we need node labels that can be computed, rather than being fixed in advance. It is natural to use database ideas to keep track of everything, and the nodes and their relations begin to look more like frames.
  • 58. b) Frame Based Knowledge Representation The original idea of frames was developed by Minsky (1975) who defined them as “data structures for representing stereotyped situations”, such as going into a class room. It is an object-oriented approach. A frame represents an object (document or information) or class of objects (collection of documents or information) or several facts. When they represent a class of objects, they generalize certain groups identifying overall properties of those groups, it shares.
  • 59. The pointers where properties are stored are known as slots. Similarly, if frame represents an object, slots represent the properties or attributes of the object. Slots contain value for that particular attribute. For example, a book in a library is an object, therefore it can be represented as frame. The properties of book, i.e., Title, Author, Place, Publisher and so on are stored as slots and each slot would have corresponding value.
  • 60. Frame: Book Slots: Title Author Publisher Place Size Value: Information Storage & Retrieval G. G. Chaudhury Ess Ess Publication New Delhi 18 X 14 cm
  • 61. The simplest type of frame is just a data structure with similar properties and possibilities for knowledge representation as a semantic network, with the same ideas of inheritance and default values Frames become much more powerful when their slots can also contain instructions (procedures) for computing things from information in other slots or in other frames
  • 62. Class Room is-a: Room Location: Department Contains: {Desk, Bench, Black Board, Table, Chairs..} : Class Room Chair Is a: Chair Location: Class Room Height: 20-40cm Legs: 4 Comfortable: Yes Use: Sitting Basic Idea: A frame consists of a selection of slots which can be filled by values, or procedures for calculating values, or pointers to other frames. For example:
  • 63. This type of frames are now generally referred to as Scripts. Attached to each frame will then be several kinds of information. Some information can be about how to use the frame. Some can be about what one can expect to happen next, or what one should do next. Some can be about what to do if our expectations are not confirmed. Then, when one encounters a new situation, one can select from memory an appropriate frame and this can be adapted to fit reality by changing particular details as necessary A complete frame based representation will consist of a whole hierarchy or network of frames connected together by appropriate links/pointers
  • 64. c) Rule-Based Knowledge Representation Rule based representation is a popular approach. Rules are employed to state the way in which the inference has to be done. Rules provide a formal way of representing recommendations, directives, or strategies. Rules are appropriate when the domain knowledge results from empirical associations developed through years of experience in solving problems in a given area.
  • 65. Rules are expressed in the form of IF-THEN statements. For example: If search is in collection of BOOKS THEN display Title, Author, Place, Publisher, Year, Physical Description, ISBN If search is in collection of ARTICLES THEN display Title, Author, Name of Journal, Volume, Issue, Year, ISSN Rules – antecedent clause (condition) related to a consequent clause Formalisms (action) by implication if (A and B) THEN S1
  • 66. The syntax structure is IF <premise>THEN<action> <premise>– is Boolean. The AND, and to a lesser degree OR and NOT, logical connectives are possible. <action>– a series of statements
  • 67. In a rule based expert system, the domain knowledge is represented as a set of rules that are checked against a collection of facts or knowledge about the current situation. When the IF portion of the rule is satisfied by the facts, the action specified by the THEN portion is performed. When the condition is satisfied the rule is said to ‘fire’ or ‘execute’. A rule interpreter is used to compare the IF portions of rules with the facts and execute the rule whose IF portion matches the facts. This is a real success story of AI – tens of thousands of working systems deployed into many aspects of life
  • 68. Normally, the term 'rule-based system' is applied to systems involving human-crafted or curated rule sets. Rule-based systems constructed using automatic rule inference, such as rule-based machine learning, are normally excluded from this system type Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way. They are often used in artificial intelligence applications and research. A rule-base system (or production system) is a KBS in which the knowledge is stored as rules; an expert system is a RBSs in which the rules come from human experts in a particular domain
  • 69. B. Indexing and Clustering Indexing An index or database index is a data structure which is used to quickly locate and access the data in a database table. Indexing is a way to optimize performance of a database by minimizing the number of disk accesses required when a query is processed.
  • 70. Indexes are created using some database columns: • The first column is the Search key that contains a copy of the primary key or candidate key of the table. These values are stored in sorted order so that the corresponding data can be accessed quickly (Note that the data may or may not be stored in sorted order). • The second column is the Data Reference which contains a set of pointers holding the address of the disk block where that particular key value can be found.
  • 71. Clustered Indexing • Clustering index is defined on an ordered data file. The data file is ordered on a non-key field. In some cases, the index is created on non-primary key columns which may not be unique for each record. In such cases, in order to identify the records faster, we will group two or more columns together to get the unique values and create index out of them. This method is known as clustering index. • Basically, records with similar characteristics are grouped together and indexes are created for these groups. • For example below, students studying in each semester are grouped together. i.e. 1st Semester students, 2nd semester students, 3rd semester students etc are grouped.
  • 72.
  • 73. III. Search Agent Search Agents are vital components of any ISAR system. These are basically programs which takes input from Search Interface and searches in the Knowledge Base using existing index. A good ISAR system means efficient retrieval. Thus, a good search agent must be equipped with following features: facility of using Boolean operators context setting to search terms use of clustering algorithms use of phonetic algorithms (soundex and metaphone algorithms)
  • 74. Boolean Operators Boolean Operators are simple words (AND, OR, NOT or AND NOT) used as conjunctions to combine or exclude keywords in a search, resulting in more focused and productive results. AND and NOT operators increase precision whereas OR increases recall of search results. The shaded area in the diagram represents retrieved records in the following example.
  • 75.
  • 76.
  • 77. Using these operators can greatly reduce or expand the amount of records returned. Boolean operators are useful in saving time by focusing searches for more 'on-target' results that are more appropriate to your needs, eliminating unsuitable or inappropriate. Each search engine or database collection uses Boolean operators in a slightly different way or may require the operator be typed in capitals or have special punctuation. The specific phrasing will be found in either the guide to the specific database found in Research Resources or the search engine's help screens.
  • 78. AND—requires both terms to be in each item returned. If one term is contained in the document and the other is not, the item is not included in the resulting list. (Narrows the search) Example: A search on stock market AND trading includes results contains: stock market trading; trading on the stock market; and trading on the late afternoon stock market
  • 79. OR—either term (or both) will be in the returned document. (Broadens the search) Example: A search on ecology OR pollution includes results contains: documents containing the world ecology (but not pollution) and other documents containing the word pollution (but not ecology) as well as documents with ecology and pollution in either order or number of uses.
  • 80. NOT or AND NOT ( dependent upon the coding of the database's search engine)—the first term is searched, then any records containing the term after the operators are subtracted from the results. (Be careful with use as the attempt to narrow the search may be too exclusive and eliminate good records). If you need to search the word not, that can usually be done by placing double quotes (<< >>) around it. Example: A search on Mexico AND NOT city includes results contains: New Mexico; the nation of Mexico; US-Mexico trade; but does not return Mexico City or This city's trade relationships with Mexico.
  • 81. Using Parentheses—Using the ( ) to enclose search strategies will customize your results to more accurately reflect your topic. Search engines deal with search statements within the parentheses first, then apply any statements that are not enclosed. Example: A search on (smoking or tobacco) and cancer returns articles containing: smoking and cancer; tobacco and cancer smoking; cancer, and tobacco; but does not return smoking or tobacco when cancer is not mentioned.
  • 82. Context Setting Context Setting requires content analysis of document. Here one analyses document manually or automatically in order to preserve the context of each term in the index. It can be done in two ways: i. Conceptual Analysis ii. Relational Analysis.
  • 83. Conceptual analysis Conceptual analysis can be thought of as frequency of concepts. Concept can be represented by texts as well as pictures. To analyze the concept one looks for the appearance of words in the text. It is not necessary that same word appears always, there may be synonymous terms present. For example, if one is analyzing a certain document is about freedom then one should look for the related words like liberation, independence, etc.
  • 84. Relational analysis Relational analysis goes one step further by examining the relationships among concepts in a text. In relational analysis we look for what are the related words appearing next to the word in question. For example, to see what are the words that appear next to freedom and then determine the related concepts. Freedom: i. Freedom of speech and expression: Article 19 (1) (a) of Constitution of India, Fundamental Rights & duties, …. ii. Freedom of opinion and Expression: article 19 of UN Universal declaration of Human Rights, Citizen’s responsibility,….
  • 85. Clustering Algorithms Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. Clustering is a method by which large sets of data is grouped into groups or clusters of smaller sets of similar data based on some characteristics. A cluster refers to a collection of data points aggregated together because of certain similarities. For example, in a group of players one can cluster players according to their specialisation of game, like those who play cricket, those who play hockey and so on.
  • 86. A clustering algorithm attempts to identify natural groups of components or data based on some similarity in a given population. In other words, it is a method to create subclass in a given class. The first thing in such algorithms are identification of core entity which is also known as centroid. A centroid is the imaginary or real location representing the center of the cluster. Around centroid similar kind of entities are identified. In a clustering algorithm, our final goal is to represent this unordered data in an organized way, and divide it into clusters.
  • 87. K-means Algorithm K-means algorithm is an algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
  • 88. K-Means Clustering K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
  • 89. Mean Shift Clustering Algorithm Mean Shift clustering algorithm is an unsupervised clustering algorithm that groups data directly without being trained on labelled data. The nature of the Mean Shift clustering algorithm is hierarchical in nature, which means it builds on a hierarchy of clusters, step by step. Mean Shift essentially starts off with a kernel, which is basically a circular sliding window. The bandwidth, i.e. the radius of this sliding window will be pre-decided by the user.
  • 90. A very high level view of the algorithm can be of : STEP 1: Pick any random point, and place the window on that data point. STEP 2: Calculate the mean of all the points lying inside this window. STEP 3: Shift the window, such that it is lying on the location of the mean. STEP 4: Repeat till convergence Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post- processing stage to eliminate near-duplicates to form the final set of centroids
  • 91. Mean-Shift Clustering: in a single window What we're trying to achieve here is, to keep shifting the window to a region of higher density. This is why, we keep shifting the window towards the centroid of all the points in the window. This feature of Mean Shift algorithm describes it's property as a hill climb algorithm
  • 93.
  • 95. Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
  • 97. Phonetic algorithm • A phonetic algorithm is a algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. • They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages.
  • 98. Best Known phonetic Algorithms: i. Metaphone Algorithm (Metaphone, Double Metaphone, and Metaphone 3) ii. Soundex iii. Daitch–Mokotoff Soundex iv. Cologne phonetics v. New York State Identification and Intelligence System (NYSIIS) vi. Match Rating Approach vii. Caverphone
  • 99. Metaphone is an algorithm which encodes pronunciation of a word letter-by-letter basis, it encodes groups of letters i.e. a word. Metaphone embodies more accurately the rules of pronunciation in language. Such algorithms are well established for English as a language. Both algorithms return all the words that exactly match the desired word as well as all similar sounding names. Metaphone has attained different versions in its development, like, Double Metaphone , Metaphone 3 etc, depending on its accuracy of spelling check.
  • 100. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. Soundex and metaphone algorithms are almost the same kind of algorithm. Both these algorithms are based in the way pronunciation of a word is made. In soundex algorithm, a numeric code is assigned to each character used in a word and when search is performed, words with similar codes are also brought out in search result.
  • 101. Soundex is the most widely known of all phonetic algorithms is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, SQLite, Ingres, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for "phonetic algorithm".[
  • 102. Common uses • Spell checkers can often contain phonetic algorithms. The Metaphone algorithm, for example, can take an incorrectly spelled word and create a code. The code is then looked up in directory for words with the same or similar Metaphone. Words that have the same or similar Metaphone become possible alternative spellings. • Search functionality will often use phonetic algorithms to find results that don't match exactly the term(s) used in the search. Searching for names can be difficult as there are often multiple alternative spellings for names. An example is the name Claire. It has two alternatives, Clare/Clair, which are both pronounced the same. Searching for one spelling wouldn't show results for the two others. Using Soundex all three variations produce the same Soundex code, C460. By searching names based on the Soundex code all three variations will be returned.
  • 103. Evaluation of ISAR systems Evaluation is a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards. It can assist an organization, program, project or any other intervention or initiative to assess any aim, realisable concept/proposal, or any alternative, to help in decision making; or to ascertain the degree of achievement or value in regard to the aim and objectives and results of any such action that has been completed.
  • 104. Evaluation is the structured interpretation and giving of meaning to predict or actual impacts of proposals or results. It looks at original objectives, and at what are either predicted or what was accomplished and how it was accomplished. So evaluation can be formative that is taking place during the development of a concept or proposal, project or organization, with the intention of improving the value or effectiveness of the proposal, project, or organization. It can also be summative, drawing lessons from a completed action or project or an organization at a later point in time or circumstance
  • 105. Evaluation is inherently a theoretically informed approach and consequently any particular definition of evaluation would have be tailored to its context - the theory, approach, needs, purpose, and methodology of the evaluation process itself. A systematic, rigorous, and meticulous application of scientific methods to assess the design, implementation, improvement, or outcomes of a program. It is a resource- intensive process, frequently requiring resources, such as, evaluator expertise, labour, time, and a sizeable budget.
  • 106. Evaluation of information retrieval system measure which of the two existing system perform better and try to assess how the level of performance of a given can be improved, i.e. it measures two parameters: i. Effectiveness ii. Efficiency
  • 107. By effectiveness it means the level up to which the given system attained its objectives. Thus in information retrieval system effectiveness may be measure of how far it can retrieve relevant information accurately while withholding non-relevant information. A search engine that is extremely fast is of no use unless it produces good results.
  • 108. Efficiency means how economically the system is achieving its objectives. In an information retrieval system efficiency can be measured be factor such as cost. The cost factors are to be calculated indirectly. They include factor such as response time, time taken by the system to provide an answer. User effort, the amount of time and effort needed by a user to interact with the system and analysed the output retrieved in order to get the correct information.
  • 109. Lancaster state that evaluation of information retrieval system can be justified by the following three issues: 1. How well the system is satisfying its objectives 2. How efficiently it is satisfying its objectives and 3. Whether the system justified its existence.
  • 110. PURPOSE OF EVALUATION Swanson state seven purposes for evaluation: 1. To assess a set of goals, a programme plan, or a design prior to implementation. 2. To determine whether and how well goals or performance expectation are being fulfilled. 3. To determine specific reasons for success and failure. 4. To uncover principles underlying a successful programme. 5. To explore technique for increasing programme effectiveness. 6. To established a foundation of further research on the reason for the relative success of alternative technique and 7. To improve the means employed for attaining objectives or to redefine sub goals or goals in view of research findings
  • 111. Keen give three major purpose of evaluation for an information retrieval system: 1. The need for measures with which to make merit comparisons within a single test situation. In other words, evaluation studies are conducted to compare the merits or demerits of two or more system 2. The need for measure with which to make comparison between results obtained in different test situation 3. The need for assessing the merit of a real-life system.
  • 112. EVALUATION CRITERIA FOR ISRS Evaluation of Information Retrieval is conduct into two different viewpoints. 1. Managerial view: when evaluation is conducted from managerial point of view it is called managerial oriented evaluation. 2. User view: when evaluation is conducted from the user point of view it is called user-oriented evaluation study.
  • 113. Criteria for evaluation of ISRS (Managerial view) Lancaster in 1971 proposed five evaluation criteria: 1. Coverage of the system 2. Ability of the system to retrieve wanted items (i.e. recall) 3. Ability of the system to avoid retrieval of unwanted items (i.e. precision) 4. The response time of the system, and 5. The amount of effort required by the user
  • 114. Vickery advocate six criteria for evaluation of ISRS and grouped into two sets as follows: Set 1 1. Coverage- the proportion of the total potentially useful literature that has been analyzed. 2. Recall- the proportion of such references that are retrieved in a search, and 3. Response time- the average time needed to obtain a response from the system.
  • 115. Set 2 4. Precision- the ability of the system to screen out irrelevant references 5. Usability- the value of the references retrieved, in terms of such factors as their reliability, comprehensibility, currency and 6. Presentation- the form in which search results are presented to the user.
  • 116. Cleverdon (1966) identified six criteria for the evaluation of ISRS: 1. Recall- the ability of the system to present all the relevant items. 2. Precision- the ability of the system to present only those items that is relevant. 3. Time lag- the average interval between the time the search request is made and the time an answer is provided. 4. Effort- intellectual as well as physical required from the user in obtaining answer to the search request. 5. Form of presentation- search output, which effects the user ability to make use of the relevant items and 6. Coverage of the collection- the extent to which the system includes relevant matter.
  • 117. Criteria for evaluation of ISRS (User-Centred Evaluation) User base evaluation is the most common evaluation system advocated by many information scientists. A criterion for evaluation of information retrieval system includes: 1. Recall 2. Precision 3. Fallout 4. Generality
  • 118. The user centred approach examines the information seeking task in the context of human behaviour in order to understand more completely the nature of user interaction with an information system. User centred evaluation is based on the premise that understanding user behaviour facilitates more effective system design. These studies examine the user from a behavioural science perspective using methods common to psychology, sociology, and anthropology.
  • 119. While examining user centered approaches two methods can be applied: Qualitative method of evaluation Quantitative method evaluation
  • 120. Qualitative method of evaluation Qualitative methods of evaluation such as case studies, focus groups or in-depth interviews can be combined with objective measures to produce more effective information retrieval research and evaluation. Quantitative method evaluation In Quantitative method evaluation empirical methods such as experimentation are frequently employed to observe and probe subjective and affective factors quantitatively.
  • 121. According to Saracevic & Kantor (1988), the key to the future of information systems and searching processes lies not in increased sophistication of technology, but in increased understanding of human involvement with information. Therefore, there has been an increased interest in qualitative methods that capture the complexity and diversity of human experience in information storage and retrieval system and its process.
  • 122. Recall The term recall refers to a measure of whether a particular item is retrieved or the extent to which the retrieval of wanted items occurs. Recall is defined as the proportion of the total relevant documents that is retrieved out of total relevant document stored in the collection.
  • 123. Whenever a user puts his/her query, it is the responsibility if the system to retrieve all those items that is relevant to the given query. When the collection is large it is not possible to retrieve all the relevant items. Thus, a system is able to retrieve a proportion of the total relevant document in response to a given query. The performance of a system is often measured by recall ratio, which denotes the percentages of relevant items retrieved in a given situation.
  • 124. The general formula for calculation of recall may be state as: Number of relevant item retrieved Recall=——————————————————————-- x 100 Total number of relevant items in the collection
  • 125. Example, if there are 100 documents in a collection that are relevant to a given query and 60 of these items are retrieved in a given search, then the recall is state to be 60%. Number of relevant item retrieved Recall=——————————————————————-- x 100 Total number of relevant items in the collection 60 Recall = ——————----- x 100 100 = 60% In other words the system has been able to retrieve 60% of the relevant items.
  • 126. Precision By precision we mean how precisely a particular system function. Precision is defined as the proportion of documents retrieved that is relevant out of total number retrieved documents. In precision the non-relevant items is discarded by the user. The general formula for calculation of precision may be state as: Number of relevant item retrieved Precision=———————————————————x 100 Total number of items retrieved
  • 127. Example, if in a given search the system retrieves 80 items, out of which 60 are relevant and 20 are non-relevant, the precision is 75%. Number of relevant item retrieved Precision=———————————————————x 100 Total number of items retrieved 60 Precision = ——————x 100 80 = 75%
  • 128. Recall-precision matrix The recall is related to the ability of the system to retrieve relevant documents, and precision related to its ability not to retrieve non-relevant documents. The ideal system attempts to achieve 100% recall and 100% precision is not possible in practice, because as the level of recall increase precision tends to decrease. According to Lancaster recall and precision tend to vary inversely.
  • 129. Following example show the relationship between recall and precision of a given search: In a given situation a system: i. retrieved a+b number of documents, out of which, ii. a documents are relevant, and iii. b documents are non-relevant (but retrieved). iv. c+d document are left in the collection after the search has been conducted. v. Out of the c+d number, c document are relevant to the query but could not be retrieved, and vi. d document are not relevant (and not retrieved) and thus have been correctly rejected.
  • 130. Recall-precision matrix Relevant Not-Relevant Total Retrieved a (Hits) b (Noise) a +b Not-Retrieved c (Misses) d (Rejected) c + d Total a + c b + d a + b+ c + d Lancaster suggests that these statistics can be represented in a 2 x 2 matrix, as shown below:
  • 131. The system retrieves a relevant document along with b non-relevant documents. Thus following Lancaster it can be stated that a denoted hits and b denotes the noise. Now out of the remaining c+d document, the system misses c document that should have been retrieved, but it correctly rejected d document that are not to the given query. The recall and precision ratio in this case can be calculated as R= [a/ (a+c)] x 100 P= [a/ (a+b)] x 100
  • 132. The value of recall can be increase by increasing the value of a, that is by retrieving a greater number of relevant items. This can be achieved by increasing the number of retrieved document, but as the number of items retrieved increases, so also increase the likelihood of retrieval of non -relevant items that is b, which decreases the value of precision. Lancaster therefore states that recall and precision tend to vary inversely. In a retrieval environment when we want to retrieve more relevant items, we generally broaden our search
  • 133. The relationship between recall and precision can be examine by considering searches held at different levels with the same set of documents and request. Beginning with very general search terms high recall and low precision can be achieved, and as the search terms becomes more and more specific recall tends to go down and precision tends to go up. In real -life situation, user normally does not want very high recall. In general most users want a few documents in response to a query, meaning a moderate level of recall.
  • 134. Limitations of recall and precision i. Difference in the level of precision and accuracy: Different users may want different levels of recall. A person going to prepare a state-of-the-art report on a topic would like to have all the items available on the topics and therefore will go for high recall. Whereas, a user wanting to know about a given topic will prefer to have a few items and thus will not require a high recall.
  • 135. ii. Difference in judgment on degree of relevance Another drawback of recall is that it assumes that all relevant items have the same value, which is not true. The retrieved items may have different degree of relevance and this may vary from user to user, and even form time to time to the same user. Both recall and precision depend largely on the relevance judgment of the user
  • 136. iii. Measures for system performance not for relevance judgment Despite their apparent simplicity, these are slippery concepts, depending for their definition on relevance judgments which are subjective at best. Because these criteria are document-based, they measure only the performance of the system in retrieving items to the information need. They do not consider how the information will be used, or whether, in the judgment of the user, the documents fulfill the information need. These limitations of precision and recall have been acknowledged and the need for additional measures and different criteria for effectiveness has been identified.
  • 137. Fallout Fallout ratio is the proportion of non-relevant items that has been retrieved out of all non- relevant documents available in a given search No. of Retrieved Non Relevant document Fallout = ----------------------------------------- ----------------------x 100 Total No. of Non Relevant document
  • 138. Generality Generality ratio is the proportion of relevant items (retrieved & non retrieved) in a given search No. of Relevant document Generality = ----------------------------------------- ------------x 100 Total No. of document
  • 139. Retrieval Measure SYMBOL EVALUATION MEASURE FORMULA EXPLANATION R RECALL a/ (a + c) Proportion of relevant items retrieved P PRECESSION a/ (a + b) Proportion of retrieved items that are relevant F FALLOUT b/ (b + d) Proportion of non-relevant items retrieved G GENERALITY (a + c)/ (a+b+c+d) Proportion of relevant items per query
  • 140. Assessment of Evaluation criteria Different stakeholders, such as information professionals, systems designers and users, may have different need and expectations of an IR system and accordingly objectives, decision, process, design or action of an IR system are set. Evaluation is a process whose main purpose is to assess whether the IR system is working what it is expected to do. These assessment are done by measuring the features such as Recall, Precision, Fallout and Generality ratio. The analysis of results of these features determines the performance level of the IR system in respect to the following : Effectiveness Usability Satisfaction Cost
  • 141. Effectiveness Effectiveness is the system’s ability or success to retrieve relevant information which meet the needs of the user. The two most commonly used measures of system performance are the recall ratio and the precision ratio Relevant Not relevant Retrieved A B Not retrieved C D Totals A + C B + D
  • 142. The search results in the Table above may have four possible outcome: 1. Relevant documents successfully retrieved – A (hits) 2. Non-relevant documents retrieved- B (noise) 3. Relevant documents failed to retrieve - C (miss) 4. Non-relevant document not retrieved and successfully dodged - D total relevant retrieved A Recall= ----------------------------- x 100 = ---------- x 100 = system’s ability total relevant in system (A + C) to retrieve relevant Information/Doc total relevant retrieved A Precision = ----------------------------- x 100 = ---------- x 100 = system’s ability total retrieved (A + B) to suppress irrelevant Information/Doc
  • 143. total irrelevant retrieved B Fallout = ----------------------------- x 100 = --------x 100 = system’s ability to total irrelevant (B+ D) suppress irrelevant Information/Doc Thus, assessment of all the above factors, i.e. Recall, Precision & Fallout actually measures effectiveness of an IR system. Indexing systems and search software should be designed to maximize both recall and precision, that is, in other words to minimize noise and misses. It may be difficult to measure the total number of relevant document in an IRS. Because it involves examining every document in the system for its potential relevance to a specific search query. For web search engines such as Google this is clearly impossible
  • 144. Usability Usability is part of the broader term “user experience” and refers to the ease of access and/or use of a product or website. A design is not usable or unusable per-se; its features, together with the context of the user (what the user wants to do with it and the user’s environment), determine its level of usability. The official ISO 9241-11 definition of usability is: “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.”
  • 145. A usable interface has three main outcomes: • It should be easy for the user to become familiar with and competent in using the user interface during the first contact with the website. • It should be easy for users to achieve their objective through using the website. If a user has the goal of booking a flight, a good design will guide him/her through the easiest process to purchase that ticket. • It should be easy to recall the user interface and how to use it on subsequent visits. So, a good design on the travel agent’s site means the user should learn from the first time and book a second ticket just as easily. Usability is what determines whether a design’s existing attributes make it stand or fall
  • 147. Satisfaction There is no agreed definition of user satisfaction within the information science and information system communities. User satisfaction is a subjective variable, which can be influenced by several factors such as system effectiveness, user effectiveness, user effort, and user characteristics and expectations. Therefore, information retrieval evaluators should consider all these factors in obtaining user satisfaction and in using it as a criterion of system effectiveness. Applegate outlines three different models of searcher satisfaction namely: The material satisfaction model The emotional satisfaction- simple path model The emotional satisfaction- multiple path model
  • 148. Search result would be an appropriate measure of the material satisfaction model. Both emotional satisfaction models are based upon subjective impressions and assessments which may be affected by factors such as: Search task Search setting The searcher’s ability, quality & judgment in digital environments Service quality website quality Literatures used
  • 149. Cost Users may experience costs in terms of any payment that they need to make for system or document access but the most significant cost is associated with the time that they expend in searching a system. Search algorithm, the options for the display of hits, the seamlessness of the stages in individual systems and interoperability between systems are important factors to satisfy an users regardless of materialistic cost.
  • 150. Information Retrieval Models An Information Retrieval Model is nothing but a framework of action process or method of matching information need and retrieval of information from databases, knowledge bases and information systems The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. We use the word "document" as a general term that could also include non-textual information, such as multimedia objects.
  • 151. According to Marcus (1994) & Marchionini (1992) Information seeking is a form of problem solving mechanism. It proceeds according to the interaction among eight sub processes: i. problem recognition and acceptance, ii. problem definition, iii. search system selection, iv. query formulation, v. query execution, vi. examination of results (including relevance feedback), vii. information extraction, and viii. reflection/iteration/termination. Again, To be able to perform effective searches, users have to develop the following expertise: i. knowledge about various sources of information, ii. skills in defining search problems and applying search strategies, iii. competence in using electronic search tools.
  • 152.
  • 153. a general overview of the information retrieval process, which has been adapted from Lancaster and Warner (1993).
  • 154. The Figure above represents a general model of the information retrieval process, where both the user's information need and the document collection have to be translated into the form of surrogates to enable the matching process to be performed. This figure has been adapted from Lancaster and Warner (1993).
  • 155. How a general IR Model works 1. Users have to formulate their information need in a form that can be understood by the retrieval mechanism 2. Likewise, the contents of large document collections need to be described in a form that allows the retrieval mechanism to identify the potentially relevant documents quickly. (In both cases, information may be lost in the transformation process leading to a computer-usable representation. Hence, the matching process is inherently imperfect)
  • 156. 3. Once the specified query has been executed by IR system, a user is presented with the retrieved document surrogates 4. Either the user is satisfied by the retrieved information or he will evaluate the retrieved documents and modify the query to initiate a further search. The process of query modification based on user evaluation of the retrieved documents is known as relevance feedback. (Information retrieval is an inherently interactive process, and the users can change direction by modifying the query surrogate, the conceptual query or their understanding of their information need) 5. The results, which have been obtained in studies investigating the information-seeking process, that describe information retrieval in terms of the cognitive and affective symptoms commonly experienced by a library user.
  • 157. How a general IR Model works 1. Users have to formulate their information need in a form that can be understood by the retrieval mechanism. The information need can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query. The conceptual query captures the key concepts and the relationships among them. It is the result of a conceptual analysis that operates on the information need, which may be well or vaguely defined in the user's mind. This analysis can be challenging, because users are faced with the general "vocabulary problem" as they are trying to translate their information need into a conceptual query. This problem refers to the fact that a single word can have more than one meaning, and, conversely, the same concept can be described by surprisingly many different words. Further, the concepts used to represent the documents can be different from the concepts used by the user. The conceptual query can take the form of a natural language statement, a list of concepts that can have degrees of importance assigned to them, or it can be statement that coordinates the concepts using Boolean operators. Finally, the conceptual query has to be translated into a query surrogate that can be understood by the retrieval system.
  • 158. 2. Likewise, the contents of large document collections need to be described in a form that allows the retrieval mechanism to identify the potentially relevant documents quickly. Similarly as the point No.1, the meanings of documents need to be represented in the form of text surrogates that can be processed by computer. A typical surrogate can consist of a set of index terms or descriptors. The text surrogate can consist of multiple fields, such as the title, abstract, descriptor fields to capture the meaning of a document at different levels of resolution or focusing on different characteristic aspects of a document.
  • 159. 3. Once the specified query has been executed by IR system, a user is presented with the retrieved document surrogates (i.e. A typical document surrogate can consist of a set of index terms or descriptors. The text surrogate can consist of multiple fields, such as the title, abstract, descriptor fields to capture the meaning of a document)
  • 160. 4. Either the user is satisfied by the retrieved information or he will evaluate the retrieved documents and modify the query to initiate a further search. The process of query modification based on user evaluation of the retrieved documents is known as relevance feedback. Information retrieval is an inherently interactive process, and the users can change direction by modifying the query surrogate, the conceptual query or their understanding of their information need
  • 161. 5. The results, which have been obtained in studies investigating the information-seeking process, that describe information retrieval in terms of the cognitive and affective symptoms commonly experienced by a library user. Cognitive syndrome like uncertainty, confusion, and frustration are nearly universal experiences in the early stages of the search process, and they decrease as the search process progresses and feelings of being confident, satisfied, sure and relieved increase. The studies also indicate that cognitive attributes may affect the search process. User's expectations of the information system and the search process may influence the way they approach searching and therefore affect the intellectual access to information. The findings by Kuhlthau et al. (1990) indicate that thoughts about the information need become clearer and more focused as users move through the search process.
  • 162. Search or Browsing? The conceptual query can take the form of a natural language statement, a list of concepts that can have degrees of importance assigned to them, or it can be a statement that coordinates the concepts using Boolean operators. Finally, the conceptual query has to be translated into a query surrogate that can be understood by the retrieval system. Analytical search strategies require the formulation of specific, well-structured queries and a systematic, iterative search for information. Browsing involves the generation of broad query terms and a scanning of much larger sets of information in a relatively unstructured fashion.
  • 163. Campagnoni et al. (1989) have found in information retrieval studies in hypertext systems that the predominant search strategy is "browsing" rather than "analytical search". Many users, especially novices, are unwilling or unable to precisely formulate their search objectives. Browsing places less cognitive load on them. Furthermore, research showed that search strategy is only one dimension of effective information retrieval
  • 164. Irrespective of any retrieval environment, the following four main system components must be taken into account in formulation of the retrieval problem. a) The objects, documents, or records themselves (which in the aggregate constitute the information files to be processed); b) The information identifiers, terms, index terms, keywords, attributes, etc. (which characterise the records or documents and represent the information content in each case); c) The information requests (which enter into the system and are to be compared with the stored records for retrieval); and d) The relevance information (often supplied by the users of the system connecting the information requests to the stored information items).
  • 165. MODELS BASED ON INPUT/OUTPUT On the basis of input and the output, Information Retrieval Models can be grouped into three basic categories: i) Data Retrieval Model ii) Information Retrieval Model iii) Knowledge Retrieval Model.
  • 166. i) Data Retrieval Model Data retrieval model essentially handles data which may be taken as unprocessed information or preliminary phase of information. Data is an unbiased fact which can be used to form an information. Here, the expression of information need should be very precise. For example, population data, day to day temperature, daily rainfall, transaction status at ATM, etc. The data retrieval model is a simple model of information retrieval needing specific matching techniques.
  • 167. 2. Information Retrieval Model Information Retrieval Model actually combines several data into a relational structure of information. Therefore, relatively it is a more complex model in comparison to Data Retrieval Model as because It has to comprehend multi-dimensional relationships amongst data. It is not amenable easily to a taxonomic structure. The representation of information is to be based on a relational data base structure using some associative mathematics. The expression of information need is also complex and time consuming. It draws out for a long conversational or browsing process and the information retrieval model must incorporate such facilities and interfaces.
  • 168. 3. Knowledge Retrieval Model Knowledge is a kind of integration of general types of information. It normally occurs in the human mind. The human mind infers and integrates several coordinates with the information received by it. So, knowledge is assimilated information. In order to facilitate decision-making and problem solving, intelligent knowledge based information retrieval models are coming up. Such systems comprise three basic aspects: i. knowledge base, ii. inference engine, iii. user interface
  • 169. a) The so-called knowledge base or a store of accumulated set of rules for converting information into knowledge. It also incorporates knowledge acquisition system. b) The second aspect of the system is inference engine. An inference engine is capable of deriving appropriate information from the combination of rules for deriving a synthesized knowledge. This process of deriving is based on inferential logic using quantitative and non-quantitative techniques. c) A user interface, i.e., conversational process in the model which is capable of receiving information in the conversation mode and converting it into database signals for interaction purposes. Thus, a knowledge retrieval model is a sophisticated model of information processing, organization and retrieval.
  • 170. MAJOR IR MODELS (BASED ON THEORIES AND TOOLS) 1. Boolean Retrieval 1.1 Standard Boolean 1.2 Narrowing and Broadening Techniques 1.3 Smart Boolean Models 1.4 Extended Boolean Models 2. Statistical Model 2.1 Vector Space Model 2.2 Probabilistic Model 2.3 Latent Semantic Indexing 3. Linguistic and Knowledge-based Approaches 3.1 DR-LINK Retrieval System
  • 171. 1.1 Standard Boolean Boolean logic allows a user to logically relate multiple concepts together to define what information is needed. The typical Boolean operators are AND, OR, and NOT. These operations are implemented using set intersection, set union and set difference procedures. A few systems introduced the concept of ‘Exclusive OR’ but it is not generally useful to users since most users do not understand it.
  • 172. 1. Standard Boolean It has the following strengths: 1. It is easy to implement and it is computationally efficient [Frakes and Baeza-Yates 1992]. Hence, it is the standard model for the current large-scale, operational retrieval systems and many of the major on-line information services use it. 2. It enables users to express structural and conceptual constraints to describe important linguistic features [Marcus 1991]. Users find that synonym specifications (reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the formulation of queries [Cooper 1988, Marcus 1991]. 3. The Boolean approach possesses a great expressive power and clarity. Boolean retrieval is very effective if a query requires an exhaustive and unambiguous selection. 4. The Boolean method offers a multitude of techniques to broaden or narrow a query. 5. The Boolean approach can be especially effective in the later stages of the search process, because of the clarity and exactness with which relationships between concepts can be represented.
  • 173. The standard Boolean approach has the following shortcomings: 1. Users find it difficult to construct effective Boolean queries for several reasons [Cooper 1988, Fox and Koll 1988, Belkin and Croft 1992]. Users are using the natural language terms AND, OR or NOT that have a different meaning when used in a query. Thus, users will make errors when they form a Boolean query, because they resort to their knowledge of English. 2. Only documents that satisfy a query exactly are retrieved. The AND operator is too severe because it does not distinguish between the case when none of the concepts are satisfied and the case where all except one are satisfied. Hence, no or very few documents are retrieved when more than three and four criteria are combined with the Boolean operator AND (referred to as the Null Output problem). On the other hand, the OR operator does not reflect how many concepts have been satisfied. Hence, often too many documents are retrieved (the Output Overload problem). 3) It is difficult to control the number of retrieved documents. Users are often faced with the null-output or the information overload problem and they are at loss of how to modify the query to retrieve the reasonable number documents. 4) The traditional Boolean approach does not provide a relevance ranking of the retrieved documents, although modern Boolean approaches can make use of the degree of coordination, field level and degree of stemming present to rank them [Marcus 1991]. 5) It does not represent the degree of uncertainty or error due the vocabulary problem [Belkin and Croft 1992].
  • 174.
  • 175. 1.2 Narrowing and Broadening Techniques A Boolean query can be described in terms of the following four operations: i. degree and type of coordination, ii. proximity constraints, iii. field specifications and iv. degree of stemming as expressed in terms of word/string specifications. If users want to (re)formulate a Boolean query then they need to make informed choices along these four dimensions to create a query that is sufficiently broad or narrow depending on their information needs.
  • 176. Most narrowing techniques lower recall as well as raise precision, and most broadening techniques lower precision as well as raise recall. Any query can be reformulated to achieve the desired precision or recall characteristics, but generally it is difficult to achieve both. Each of the four kinds of operations in the query formulation has particular operators, some of which tend to have a narrowing or broadening effect. For each operator with a narrowing effect, there is one or more inverse operators with a broadening effect [Marcus 1991]. Hence, users require help to gain an understanding of how changes along these four dimensions will affect the broadness or narrowness of a query.
  • 177. How the four dimensions affect the broadness or narrowness of a query is as the following : 1) Coordination: the different Boolean operators AND, OR and NOT have the following effects when used to add a further concept to a query: a) the AND operator narrows a query; b) the OR broadens it; c) the effect of the NOT depends on whether it is combined with an AND or OR operator. Typically, in searching textual databases, the NOT is connected to the AND, in which case it has a narrowing effect like the AND operator. 2) Proximity: The closer together two terms have to appear in a document, the more narrow and precise the query. The most stringent proximity constraint requires the two terms to be adjacent.
  • 178. 3) Field level: current document records have fields associated with them, such as the "Title", "Index", "Abstract" or "Full-text" field: a) the more fields that are searched, the broader the query; b) the individual fields have varying degrees of precision associated with them, where the "title" field is the most specific and the "full- text" field is the most general. 4) Stemming: The shorter the prefix that is used in truncation-based searching, the broader the query. By reducing a term to its morphological stem and using it as a prefix, users can retrieve many terms that are conceptually related to the original term [Marcus 1991].
  • 179.
  • 180. 1.3 Smart Boolean There have been attempts to help users overcome some of the disadvantages of the traditional Boolean discussed above. We will now describe such a method, called Smart Boolean, developed by Marcus [1991, 1994] that tries to help users construct and modify a Boolean query as well as make better choices along the four dimensions that characterize a Boolean query. We are not attempting to provide an in-depth description of the Smart Boolean method, but to use it as a good example that illustrates some of the possible ways to make Boolean retrieval more user-friendly and effective. Table 2.2 provides a summary of the key features of the Smart Boolean approach.
  • 181. Users start by specifying a natural language statement that is automatically translated into a Boolean Topic representation. If the statement is consisted with list of factors or concepts, then they (factors or concepts) are automatically coordinated using the AND operator. If the user at the initial stage can or wants to include synonyms, then they are coordinated using the OR operator. Hence, we understand that the Boolean Topic representation connects the different factors using the AND operator where the factors can consist of single terms; or several synonyms connected by the OR operator. One of the goals of the Smart Boolean approach is to make use of the structural knowledge contained in the text surrogates, where the different fields represent into contexts of useful information. Further, the Smart Boolean approach wants to use the fact that “related concepts can share a common stem”. For example, the concepts "computers" and "computing" have the common stem comput*.
  • 182. The initial strategy of the Smart Boolean approach is to start out with the broadest possible query within the constraints of how the factors and their synonyms have been coordinated. Hence, it modifies the Boolean Topic representation into the query surrogate by using only the stems of the concepts and searches for them over all the fields. Once the query surrogate has been performed, users are guided in the process of evaluating the retrieved document surrogates. It also create user feedback with a list of reasons. They choose from a list of reasons to indicate why they consider certain documents as relevant. Similarly, they can indicate why other documents are not relevant by interacting with a list of possible reasons. This user feedback is used by the Smart Boolean system to automatically modify the Boolean Topic representation or the query surrogate, whatever is more appropriate. The Smart Boolean approach offers a rich set of strategies for modifying a query based on the received relevance feedback or the expressed need to narrow or broaden the query
  • 183.
  • 184. Visualizing Boolean Queries through InfoCrystal: How can we make visualization of Boolean Queries without limiting its expressive power ? InfoCrystal can be used to make Boolean retrieval more transparent and easy-to-use. InfoCrystal make it much easier for users to formulate and modify Boolean queries and to achieve the desired retrieval results. InfoCrystal is nothing but a representation of a specified Boolean query. Each interior icon of the InfoCrystal represents a distinct Boolean relationship among the input criteria , hence, users can specify Boolean queries by interacting with a direct manipulation interface.
  • 185. The InfoCrystal acts as a Boolean calculator. Users do not have to use logical operators and parentheses explicitly to formulate queries. Hence, users do not have to concern themselves with the coordination problem. Instead they need to recognize the relationships of interest and select them. If an interior icon is selected, then it changes its visual appearance. In the figures of this (manipulation) interface, the center area of selected interior icons are displayed in black and the unselected ones in white
  • 186.
  • 187.
  • 188. 1.4 Extended (or Weighted) Boolean Models To address the following issues generally the P-norm and the Fuzzy Logic approaches that extend the Boolean model are used. 1) The Boolean operators are too strict and ways need to be found to soften them. 2) The standard Boolean approach has no provision for ranking. The Smart Boolean approach and the methods described in this section provide users with relevance ranking [Fox and Koll 1988, Marcus 1991]. 3) The Boolean model does not support the assignment of weights to the query or document terms. We will briefly discuss to address the above issues.
  • 189. The P-norm method developed by Fox (1983) allows query and document terms to have weights, which have been computed by using term frequency statistics with the proper normalization procedures. These normalized weights can be used to rank the documents in the order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of increasing distance from the point (1, 1, ... , 1) for an AND query. Further, the Boolean operators have a coefficient P associated with them to indicate the degree of strictness of the operator (from 1 for least strict to infinity for most strict, i.e., the Boolean case). The P-norm uses a distance-based measure and the coefficient P determines the degree of exponentiation to be used. The exponentiation is an expensive computation, especially for P-values greater than one.
  • 190. In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the traditional binary membership choice. The weight of an index term for a given document reflects the degree to which this term describes the content of a document. Hence, this weight reflects the degree of membership of the document in the fuzzy set associated with the term in question. The degree of membership for union and intersection of two fuzzy sets is equal to the maximum and minimum, respectively, of the degrees of membership of the elements of the two sets. In the "Mixed Min and Max" model developed by Fox and Sharat (1986) the Boolean operators are softened by considering the query- document similarity to be a linear combination of the min and max weights of the documents
  • 191. Weighting is the process of assigning an importance to an index term’s use in an item. The weight should represent the degree to which the concept associated with the index term is represented in the item. The weight should help in discriminating the extent to which the concept is described in items of the database. The manual process of assigning weights adds additional overhead on the indexer and requires a more complex data structure to store the weights. In a weighted indexing system, an attempt is made to place a value on the index term’s representation of its associated concept in the document. An index term’s weight is based upon a function associated with the frequency of occurrence of the term in the item.
  • 192. Typically, values for the index terms are normalised between zero and one. The higher the weight, the more the term represents a concept discussed in the item. The weight can be adjusted to account for other information such as the number of items in the database that contain the same concept. The query process uses the weights along with any weights assigned to terms in the query to determine a scalar value (rank value) used in predicting the likelihood that an item satisfies the query. The results are presented to the user in order of the rank value from highest number to lowest number.
  • 193. Table above summarizes the defining characteristics of the Extended Boolean approach and list the its key advantages and disadvantages
  • 194. If weights are assigned to the terms between the values 0.0 to 1.0, they may be interpreted as the significance that users are placing on each term. The value 1.0 is assumed to be the strict interpretation of a Boolean query. The value 0.0 is interpreted to mean that the user places little value on the term. Under these assumptions, a term assigned a value of 0.0 should have no effect on the retrieved set. Thus: “A1 OR B0” should return the set of items that contain A as a term. “A1 AND B0” will also return the set of items that contain term A. “A1 NOT B0” also return set A.
  • 196. Under the strict interpretation “A1 OR B1” would include all items that are in all the areas in the Venn diagram. “A1 OR B0” would be only those items in A (i.e., the green and Blue shaded areas) which is everything except items in “B NOT A” (the Blue area). Thus, as the value of query term B goes from 0.0 to 1.0, items from “B NOT A” are proportionally added until at 1.0 all of the items will be added. Similarly, under the strict interpretation “A1 AND B1” would include all of the items that are in the green and Blue shaded areas. “A1 AND B0” will be all of the items in A as described above. Thus, as the value of query term B goes from 1.0 to 0.0 items will be proportionally added from “A NOT B” (Green area) until at 0.0 all of the items will be added.
  • 197. Finally, the strict interpretation of “A1 NOT B1” is Green area while “A1 NOT B0” is all of A. Thus as the value of B goes from 0.0 to 1.0, items are proportionally added from “A AND B” (green and Blue shaded area) until at 1.0 all of the items have been added. The final issue here is the determination of which items are to be added or dropped in interpreting the weighted values.
  • 198. 2. Statistical Model The vector space and probabilistic models are the two major examples of the statistical retrieval approach. Both models use statistical information in the form of term frequencies to determine the relevance of documents with respect to a query. Although they differ in the way they use the term frequencies, both produce as their output a list of documents ranked by their estimated relevance. The statistical retrieval models address some of the problems of Boolean retrieval methods, but they have disadvantages of their own.
  • 199. Statistical Model 1. Vector Space Model 2. Probabilistic Model 3. Latent Semantic Indexing
  • 200. 2.1 Vector Space Model • Vector space model or term vector model is an algebraic/statistical model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. • The Vector Space Model (VSM) is a way of representing documents through the words that they contain. • The VSM allows decisions to be made about which documents are similar to each other to keyword queries
  • 201. In the Vector Space Model or system, emphasis is given in the weights as a foundation for information detection and stores these weights in a vector form. In systems based upon a vector model, the semantics of every item are represented as a vector. What is a Vector? A vector is a one-dimensional set of values, where the order/position of each value in the set is fixed and represents a particular domain. Each vector represents a document and each position in a vector represents a different unique word to represent the document in the database.
  • 202. There are two approaches to the domain of values in the vector – binary and weighted Binary: represents document (processing token) by 1 or 0 1 representing the existence of the processing token in the item. 0 representing the non-existence of the processing token in the item Weighted: represents document by keywords with set of all real positive numbers. The value assigned to each position is the weight of that term in the document. A value of zero indicates that the word is not in the document
  • 203. Queries can be translated into the vector form. Search is accomplished by calculating the distance between the query vector and the document vector. The use of weights also provides a basis for determining the rank of an item. The vector approach allows for a mathematical and a physical representation using a vector space model.
  • 204. 1. Vector Space Model
  • 205.
  • 206. If a query (q) is considered to be a line in an imaginary space and the document (d) is also considered to be a line in the imaginary space, the geometrically determined angle between the two lines can be understood as measuring the degree to which the documents are similar to the query. While in the case of a large angle the document is presumed to be dissimilar to the query, in the case of a very small angle the document is presumed to be highly similar to the question.
  • 207. How the Vector Space Model indexing procedure works? The Vector Space Model procedure can be divided into three stages: The first stage is the document indexing where the content bearing terms are extracted from the document text. It is obvious that many of the words in a document do not describe the content, like, the, is, are, in, to, of, etc. These are called non-significant words or stop words. In case of automatic indexing, these terms are removed from the document vector, so the document will only be represented by the content-bearing terms. In general, 40-50% of the total number of words, in a document, are stop words. These can be removed with the help of a stop word list.
  • 208. The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure.
  • 209. Documents and queries are represented as vectors. dj = (w1,j, w2,j, ……, wt,j) qj = (w1,q, w2,q, ……, wn,q) Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf (term frequency–inverse document frequency) weighting. The definition of term depends on the application (i.e. whether article, books, etc). Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus). Vector operations can be used to compare documents with queries.
  • 210. Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as a vector with same dimension as the vectors that represent the other documents. In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:
  • 211. The VSM is contrary to the Boolean Retrieval Model in which retrieval is based on the hundred percent (exact) match. The VSM allows retrieval of the most similar to the query without the exact match. Thus, the VSM can be well explained in terms of keyword- by-document matrix (A), in which the rows correspond to keywords (W) in the database and the columns correspond to documents (D), then the matrix will be like: D1 D2 D3 D4 ….. Dn W1 A11 A12 A13 A14 ….. A1n W2 A21 A22 A23 A24 ….. A2n A = W3 A31 A32 A33 A34 ….. A3n W4 A41 A42 A43 A44 ….. A4n ..... …. …. …. …. ….. …. Wm Am1 Am2 Am3 Am4 ….. Amn
  • 212. Let us take a hypothetical example, like, an information seeker searches information on “Education information retrieval system”. He uses four keywords: W1, W2, W3, and W4. After searching the database, he gets six articles: A1, A2, A3, A4, A5, and A6. After analysis, it is found that the Article A1 talks only about W1; Article A2 discusses 33% topic of W2 and 67% of W4; Article A3 deals with 20% of W1, 30% of W3 and 50% of W4; Article A4 deals with 60% of W1, 10% of W2 and 30% of W4; Article A5 talks 80% about W2 and 20% about W3; Article A6 discusses only about W4. Now this can be denoted in the form of a 4X6 matrix as below:
  • 213. A1 A2 A3 A4 A5 A6 W1 1.00 0.00 0.20 0.60 0.00 0.00 W2 0.00 0.33 0.00 0.10 0.80 0.00 A = W3 0.00 0.00 0.30 0.00 0.20 0.00 W4 0.00 0.67 0.50 0.30 0.00 1.00
  • 214. The VSM is a retrieval model which constitutes a fairly large class of retrieval methods, each consisting of an indexing method and a retrieval function, The indexing method generates description vectors, and the retrieval function generates retrieval status values by comparing the query description vector with the document description vectors. The information seeker is assumed to have information need, which he formulates as a query. The query q and the document dj are indexed in two steps. First appropriate indexing features are spotted in the query q and in the document dj. Secondly, these features are assigned weights to obtain the query description and the document descriptions are sets of weighted indexing features. These are called document description vector and query vector. The query description and document descriptions are matched and a score is generated for every document pair. These scores are called Retrieval Status Values (RSVs). For every query, the documents are presented to the information seeker in descending order of these RSVs.
  • 215. Each keyword in a document collection forms document vector which represents the single or multiple occurrences of the term i in document d. Similarly, a query is represented by a query vector which denotes the number of occurrences of terms in the query. Both the document vector and query vector provide the locations of the objects in the term-document space. There are two common one-dimensional measures that every vector has, length and angle with respect to a fixed point. The angle between two vectors refers to the measure in degrees between those two vectors. The document vector whose angle is closest to the query vector’s angle is the best choice, yielding the document most closely related to the query. It is measured in terms of cosine angle between the two vectors. If the cosine of the angle is 1, then the angle between the document vector and the query vector measures 0 degree, meaning the document vector and the query vector move in the same direction. A cosine measure of 0 would mean the document is unrelated to the query vector. Thus, a cosine measure close to 1 means that the document is closely related to the query.
  • 216. d2 . q = ------------------- ||d2|| ||q||