Indexing Techniques: Their Usage in Search Engines for Information Retrieval

Topics Speakers
1. Introduction and Overview Sayon Roy
2. Indexing Techniques – Transition from Manual to Automated System Kaustav Saha
3. Usage in Modern Day Search Engines Vikas Bhushan
4. Currents Trends and Applications Debashis Naskar
5. Conclusion Sumanta Bag

Indexing…. an Overview
• Indexing is a crucial part of any information retrieval system. It is a challenging task
requiring paying attention to many theoretical and practical issues. While the move
towards digital information systems and automated indexing is thought to have
reduced the need for indexers in some areas, professional indexers are still much
needed and as a matter of fact electronic environment has posed new challenges
for the indexers.
• Indexing is more a process of the extraction rather than
content analysis.
• The terms is an index represent certain concepts.

Subject Indexing and Subject Retrieval
• Subject indexing can be described as a system of classifying without notation. It
is the core theme of information science.
• Today subject retrieval is facilitated through the use of structured databases.
• The items that are retrieved are listed in the index.
• In OPACs indexing is done manually to determine what a
resource about. After identification the aboutness is translated
in the language of the vocabulary.

Schematic Illustration…
Conception of Subject
Analysis and Indexing
Type of Subject
Information
Indexing Method
Simplistic Conception
Explicit Information Extraction
Content- Oriented
Conception
Implicit Information Assignment
Requirement -Oriented
Conception

Early use of computers for Information Retrieval
• In 1948 a “machine called the Univac” capable of searching for text references
associated with a subject code was created.
• The machine could process “at the rate of 120 words per minute”. It appears that
this is the first reference to a computer being used to search for content.
• The impact of computers in IR is highlighted when Hollywood drew public
attention to the innovation with the comedy “Desk Set”, which came out in 1957.
It centred on a group of reference librarians who were about to be replaced by a
computer.
• IR as a research discipline was starting to emerge at this time with two important
developments: how to index documents and how to retrieve them.

Indexing and Information Retrieval… A Chronology
•Mortimer Taube’s Uniterm system, which was essentially a proposal to
index items by a list of keywords. As simple an idea as this seems today, this
was at the time a radical step.

Ranked retrieval
•The ranked retrieval approach to search was taken up by IR researchers,
who over the following decades refined and revised the means by which documents
were sorted in relation to a query.
•The superior effectiveness of this approach over Boolean search was demonstrated
in many experiments over those years.
• Work in the 1950s established computers as the definitive tool for search.

1960s …
•The 1960s witnessed formalization of algorithms to rank documents relative to a
Query.
•This was a process to support iterative search, where documents previously retrieved
could be marked as relevant in an IR system.
•Versions of this process are used in modern search engines, such as the “Related articles”
link on Google Scholar.

1970s…
•One of the key developments of this period was that Luhn’s term frequency (tf) weights
(based on the occurrence of words within a document)
•Spärck Jones’s work on word occurrence introduced the idea of
inverse document frequency (idf).
•An alternative means of modelling IR systems involved extending Maron, Kuhns and
Ray’s idea of using probability theory.

1980s – mid 1990s
•Building on the developments of the 1970s, variations of tf idf weighting schemes were
produced and the formal models of retrieval were extended.
•The original probabilistic model did not include tf weights and a number of researchers
worked to incorporate them in an effective and principled way.
•Amongst other achievements, this work ultimately led to the ranking function BM25
which, has proven to be a highly effective ranking function and is still commonly used.
•Advances on the basic vector space model were also developed and probably the most
well-known is Latent Semantic Indexing (LSI).

Mid 1990s – present
•The arrival of the web initiated the study of new problems in IR.
•Search engine developers quickly realised that they could use the links between
web pages to construct a crawler or robot to traverse and gather most web pages on
the internet
• The first full text search engine using a crawler was WebCrawler released in 1994.

-Kaustav Saha
Indexing Techniques – Transition from Manual to Automated System

What is an index?
•A Database where information (after being collected, parsed and
processed) is stored to allow for quick retrieval.
•Association of descriptors (keywords, concepts, metadata) to documents in
view of future retrieval
•The knowledge / expectation / behavior of the searcher needs to be
anticipated

Example of Indexing using POPSI
A report on the treatment of infections disease of lungs in India during 1982-85
Discipline Medical Science
Entity Lung
Property Infections disease
Action Treatment
Space modifier India
Time modifier 1982-85
Form modifier Report
Subject heading
MEDICAL SCIENCE, LUNG
infection disease, treatment, India, 1982-85
INFECTION DEASEASE, TREATMENT
medical science, lung, India, 1982-85
Cross Reference
Therapeutics see Treatment
Therapy see Treatment

Manual and Automatic Indexing
•Manual
•Human indexers assign index terms to documents
•A computer system may be used to record the descriptors generated by
the human
•Automatic
•The system extracts “typical”/ “significant” terms
•The human may contribute by setting the parameters or thresholds, or
by choosing components or algorithms
•Semi-automatic
•The system’s contribution may be supported in terms of word lists,
thesauri, reference system, etc, following or not the automatic
processing of the text

Manual vs. Automatic Indexing
•Manual
•Slow and expensive
•Is based on intellectual judgment and semantic interpretation (concepts, themes)
•Low consistency
•Automatic
•Fast and inexpensive
•Mechanical execution of algorithms, with no intelligent interpretation (aboutness /
relevance)
•Consistent

Vocabulary
•Vocabulary (indexing language)
•The set of concepts (terms or phrases) that can be used to index
documents in a collection
•Controlled
•Specific for specialized domains
•Potential for increased consistency of indexing and precision of
retrieval
•Un-controlled (free)
•Potentially all the terms in the documents
•Potential for increased recall

Thesauri
•Capture relationships between indexing terms
•Hierarchical
•Synonymous
•Related
•Creation of thesauri
•Manual vs. automatic
•Use of thesauri
•In manual / semi-automatic / automatic fashion
•Syntagmatic co-ordination / thesaurus-based query expansion during
indexing / searching

TEXT
REPRESENTATION
Lexical analysis
Stemming
Stop word removal
representation
Steps of automatic indexing
Collection/document structure
Data structure

Role of Indexing in Information Retrieval
Population of
Documents
Selected
documents
Indexing
Database in printed or electronic form
Search Strategy
Information Needs
Population of
database users
System
VocabularyDocument
Store
Document
Description

Usage in Modern Day Search Engines
- Vikas Bhushan
Search Engines
Use of search engines
Types of Search Engines
Software Components in Search Engines
Pictorial representation of Components
How Search Engines Works with a Model
Post-coordinate Indexes

Search engines : An initiative towards correct
retrieval from a Labyrinth of Ideas
 Search engines do not search only for keywords, some
search for other stuff as well
 and they are really not “engines” in the classical sense
but then mouse is not a “mouse”
Rather, these are computer programs that searches for
particular keywords and returns a list of documents in which
they were found, especially a service that scans documents on
the Internet.

Types of Search Engines
Crawler Based – Google, AltaVista
Human Based – Yahoo directory, Open directory, LookSmart
Hybrid Models – Yahoo, Google
Meta Search Engines – Dogpile, MetaCrawler

Use of search engines
… among others
WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and
saying, 'travel’ . They’re going to look at you with a blank face? "

Components in the Back-end & Front-end process
Software
Components
Back-end Front-end
Crawler/Spider
Indexer
Index File Database
Search Engine Interface
Query Parser
Ranking Mechanism
Google uses PageRank
Teoma uses ExpertRank
Yahoo uses TrustRank

Pictorial representation of Front-end & Back-end Process
Search Engine
Database

Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am

Post-coordinate Indexes
An Information Retrieval system that allows the searcher to combine terms in
any way is frequently referred to as Post-coordinate.
Modern computer based system, operated online, can be considered to be a
direct descendent of the previous manual system.
The files of an online system comprises two major elements:
1. A complete set of document representations : Bibliographic reference or
similar to Search engine database.
2. A list of terms sometimes referred to as an inverted file or a postings files.
Continued…

The subject matter discussed in a document, and represented by index terms
assigned to it, is multidimensional in character .
Consider, for example an article discussing
“Political Contenders in Assembly Polls of Karnataka”.
Have been index under the following terms :
 Political Contenders
 Constituencies
 Assembly Polls
 Karnataka
Post-coordinate Indexes…

Political
Contenders
Index terms mentioned previously actually represent a network of relationship
Constituencies Assembly Polls
Karnataka
Continued…

Information Retrieval System Represented as a Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A
B
C
D
E
F
G
H
X X
X X X X
X X X
X X
X X X X X X
X
X X X X
X X X X X

-DebashisNaskar
Currents Trends and Applications

Current trends and applications
 The web creates new challenges for information retrieval. The amount of information on the web is
growing rapidly, as well as the number of new users inexperienced in the art of web research.
 Automated search engines that rely on keyword matching usually return too many low quality matches.
 A large-scale search engine makes heavy use of the additional structure present in hypertext to
provide much higher quality search results.

What is XML Indexing?
 XML indexing is a form of embedded indexing in which
tags are inserted into an XML documents to mark the
occurrences of indexable terms or topics.
 The clients publishing process automatically generates
an index from these index elements. Fortunately, because
this automated process handles all layout and
formatting ,it is not necessary to treat these issues as a
matter of concern.

What makes it work?
Index entries in DocBook are encoded using the mother element and has five
child elements. There are summarized below:
• <indexterm> element: wrapper element for an index entry of any type.
• <primary> element: main entry.
• <secondary> element: subentry.
• <tertiary> element: sub-subentry.
• < see > element: ‘see’ references.
• <seealso> element: ‘seealso’ references.

Future hopes for Indexers
 Indexer should offer XML based services, which is a pre requisite for joining
the digital publishing revolution.
 Indexers are good with structures and use of XML indexing in publishing is
about imposition of structure on text.

Results and Performance
 The most important measure of a search
engine is the quality of its search results.
 Here we highlight the performance and
experience with Google. It produces
Better results than the major commercial
search engines for most searches.

Data Google bing Yahoo! Baidu Babylon Others
2012-04 91.7 3.5 3.36 0.26 0 1.18
2012-05 92.04 3.36 3.26 0.22 0 1.12
2012-06 91.75 3.27 3.04 0.23 0.29 1.42
2012-07 91.17 3.22 2.95 0.45 0.54 1.67
2012-08 91.01 3.22 2.98 0.5 0.6 1.7
2012-09 91.04 3.16 2.91 0.49 0.6 1.8
2012-10 90.75 3.35 2.91 0.54 0.58 1.87
2012-11 90.75 3.32 2.84 0.58 0.6 1.92
2012-12 90.43 3.26 2.89 0.66 0.54 2.21
2013-01 90.47 3.19 2.88 0.63 0.48 2.35
2013-02 89.64 3.62 3.17 0.73 0.39 2.45
2013-03 89.89 3.59 3.2 0.93 0.29 2.11
2013-04 90.17 3.61 3.08 0.92 0.27 1.95

Models for Information Retrieval
 Boolean or Vector space model of IR(Information Retrieval)
-In this matching is done in a formally defined but semantically imprecise calculus of Index terms.
 There are a number of retrieval models that function over a Probabilistic basis.
Binary Independence Model, is the most original and is still the most influential among other
probabilistic retrieval models.
Contd…

OKAPI BM25: model for Information Retrieval
 The BIM was originally designed for short catalogue records and
abstracts of fairly consistent length.
 For modern full-text search collections, a model should pay attention to term frequency and
document length.
The BM25 weighting scheme , often called Okapi weighting , after the system in which it was first
implemented, was developed as a way of building a probabilistic model.
Contd…

The score of any document as determined by OKAPI is determined through the following
equations:
Equation 1. The simplest score for document d is just idf weighting of the query terms present:
Equation 2. Sometimes, an alternative version of idf is used. If we start with the formula in the
absence of relevance feedback information we estimate that S = s = 0 , then we get an
alternative idf formulation as follows:
Contd…

Equation 3. We can improve on Equation 1 by factoring in the frequency of each term and
document length:
Equation 4. If the query is long, then we might also use similar weighting for query terms. This
is appropriate if the queries are paragraph long information needs, but unnecessary
for short queries:

• For implementation of indexing services individual indexers may prefer numerous approaches.
• The effectiveness of an index as a search tool will depend on the number of access points provided.
• Different factors influence the recall and precision measures for any retrieved information.
• Indexing and its usage can be made more sophisticated through implication of certain concepts like:
 Weighted Indexing
 Linking of terms
 Role Indicators
 Subheading
 Index Language Device
Conclusion: Enhancement of Indexing Procedures

• Many automatic systems include form of weighting to allow the ranking
• Weighted indexing grants autonomy on behalf of the searcher to vary the exhaustivity
• It is simplifies the process of indexing
• Weighted indexing assigns a numerical value to individual terms.
• Weighted index has two ways of retrieval from the database.
• Major and Minor descriptor.
Weighted Indexing

• For efficient and timely retrieval of appropriate and correct information
• Inappropriate or irrelevant responses can be avoided by reducing the exhaustively of index.
• Removal of unwanted or false association.
• To avoid false association by linking of index terms.
Linking of terms

• Role indicators play an important part in retrieval of accurate information
• Use of syntax to reduce ambiguity.
• Role indicators introduced into retrieval system in the early 1960s
• The first of its kind was the Engineers Joint a Council (EJC) set of role indicator.
• The document surrogate was a ‘telegraphic abstract’ by means of a ‘semantic code dictionary’.
Role Indicators

Subheadings
• The advent of automated system the need for retrieval of precise information gained importance
• The problem of false or ambiguous associations are now less
• Subheading can be applied much of post coordinate index system
• successful in reducing the ambiguities in the searching of electronic data bases

Index Language Devices
Precision Device
Weighting
Links
Role
indicators
Recall Device
Subheadings Synonym control
Inverse
Relation

Before we Conclude…
• The entire discussion was based on application of indexing techniques and principles for design of search
engines.
• To develop software tools that would allow the user to perform relatively specific subject searchers related
to resources of any type.
• Search engines operate by building ‘indexes’ to the network resources.
• The concept of Boolean logic is followed for searching purposes.
• Search engines use inverted indexes.

Conclusion
Today the internet has become versatile and is treated as a significant source of a information. The
transition from traditional to electronic form of information resources, has paved the way for creation
of various software and certain tools.
These provide enhanced navigation among resources available in electronic form and within networked
environment. However, various studies indicate that there is much ground to be cover before machines
become intelligent enough to completely replace humans. As of now the role of the human indexer is
quite indispensible.
Thus in days to come upgraded indexing techniques and principles would surely be developed thereby
ensuring efficient and timely retrieval of information from a digitized environment.

Indexing Techniques: Their Usage in Search Engines for Information Retrieval

More Related Content

What's hot

Viewers also liked

Similar to Indexing Techniques: Their Usage in Search Engines for Information Retrieval

Recently uploaded

Indexing Techniques: Their Usage in Search Engines for Information Retrieval

Editor's Notes