Phrase based Indexing and Information Retrieval

Phrase Based Indexing
By
Bala Abirami

• Introduction of Phrase Based Indexing
• What is Phrase Based Indexing?
• Back ground of Invention
• Summary on Invention
• Spam Detection

Introduction
• An information retrieval system uses phrases to
index, retrieve, organize and describe
documents.
• It was a patent application submitted by the
Google Engineer, Anna Lynn Patterson to US
• Application filed: July, 2004
• Published: January, 2006

Background of Invention
• Information retrieval systems, generally called
search engines, are now an essential tool for
finding information in large scale, diverse, and
growing corpuses such as the Internet.

• A document is retrieved in response to a query
containing a number of query terms, typically
based on having some number of query terms
present in the document.

• The retrieved documents are then ranked
according to other statistical measures, such as
frequency of occurrence of the query terms, host
domain, link analysis, and the like

Cont…
• Concepts are often expressed in phrases, such
as "Australian Shepherd," "President of the
United States," or "Sundance Film Festival".
• Accordingly, there is a need for an information
retrieval system and methodology that can
identify phrases, index documents according to
phrases, search and rank documents in
accordance with their phrases.

Summary
An information retrieval system and
methodology uses phrases to index, search,
rank, and describe documents in the document
collection.

1. Identifying Phrases and Related Phrases
2. Indexing Documents w.r.t Phrases
3. Ranking Documents w.r.t Phrases
4. Creating description for the document
5. Elimination of Duplicate Documents

Identifying Phrase and Related
Phrases
• Based on a phrase's ability to predict the
presence of other phrases in a document.
• It looks to identify phrases that have
frequent and/or distinguished/unique
usage
• Prediction measure is used for identifying related
phrases
• Prediction measure relates Actual co
-occurrence rate of two phrases to expected co-
occurrence rate of the two phrases
• Information gain = actual co-occurrence rate :

Cont…
• Two Phrases are related to each other
when the prediction measure exceeds the
prediction threshold.
• Example:
Phrase : “President of the United States”
predicts the related phrase “White House”,
“George Bush” etc.,

Indexing documents based on
related Phrases
• An information retrieval system indexes
documents in the document collection by the
valid or good phrases.
• Posting List = documents that contain the
phrase
• Second List = used to store data indicating
which of the related phrases of the given phrase
are also present in each document containing
the given phrase

Ranking

• Ranking documents is based on two factors
1. Ranking Documents based on Contained
Phrases
2. Ranking Documents based on Anchor
Phrases
• Document Score = Body Hit Score + Anchor Hit
Score
• For Example: Body Hit Score = 0.30, Anchor
Hit Score = 0.70
• Document Score = 0.30 + 0.70

Phrase Extension
• The information retrieval system is also adapted
to use the phrases when searching for
documents in response to a query.
• A user may enter an incomplete phrase in a
search query, such as "President of the“
Incomplete phrases such as these may be
identified and replaced by a phrase extension,
such as "President of the United States."

Descriptions for Documents
• Phrase information is used to create description
of a document.
• System identifies phrase present in the query,
related phrases and Phrase extensions in each
sentences and have a count for each sentences.
• Ranks the sentences based on the count.
• Selects some number of top ranking sentences
as description and includes it in the search
results.

Eliminating Duplicate documents
• Identifying and Eliminating duplicate documents while
crawling a document or when processing the search
query.
• The description is stored in association with every
document in a hash table.
• The system concatenates the newly crawled page with
that stored hash value in the Hash table. If it finds a
match, then it indicates that the current document is
duplicate value.
• The system keeps the one which has higher page rank
or more document significance and remove the duplicate
document and will not appear in future search results for
any query.

Functions of Indexing system

• Indentifies Phrases in documents
• Indexing Documents according to the
phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user
• Provides those queries to the search system

Functions of Searching System

• Searching for documents relevant to the
search query
• Identifies the phrases in the search query
• Ranking the documents

Functions of Presentation system

• Modifying the search results including
removing of duplicate content.
• Generating topical descriptions of
documents and provides modified

Spam Detection
• “Spam” pages have little meaningful content,
but may instead be made up of large
collections of popular words and phrases.
These are sometimes referred to as “keyword
stuffed pages”.

• Pages containing specific words and phrases
that advertisers might be interested in are
often called “honeypots,” and are created for
search engines to display along with paid
advertisements .

Cont…
• A phrase based indexing system knows the
number of related phrases in a document.

• A normal, non-spam document will generally
have a relatively limited number of related
phrases, typically on the order of between 8 and
20, depending on the document collection.

• A spam document will have an excessive
number of related phrases, for example on the
order of between 100 and 1000 related phrases.

Advantages of Phrase Based
Indexing

• Detecting Duplicate Pages
• Spam Detection
• Save time

Other Patent Applications
• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information
retrieval system

• Efficient Phrase Based Document Indexing for Document
Clustering

According to data collected from users of European Web
analytics provider OneStat, most people use 2- or 3-word
queries in search engines

Two-word phrases -- 28.38 percent
Three-word phrases -- 27.15 percent
Four-word phrases -- 16.42 percent
One-word phrase -- 13.48 percent
Five-word phrases -- 8.03 percent
Six-word phrases -- 3.67 percent
Seven-word phrases -- 1.63 percent
Eight-word phrases -- 0.73 percent
Nine-word phrases -- 0.34 percent
Ten-word phrases -- 0.16 percent

Phrase based Indexing and Information Retrieval

More Related Content

What's hot

Similar to Phrase based Indexing and Information Retrieval

Recently uploaded

Phrase based Indexing and Information Retrieval