Phrase Based Indexing
           By
      Bala Abirami
•   Introduction of Phrase Based Indexing
•   What is Phrase Based Indexing?
•   Back ground of Invention
•   Summary on Invention
•   Spam Detection
Introduction
• An information retrieval system uses phrases to
  index, retrieve, organize and describe
  documents.
• It was a patent application submitted by the
  Google Engineer, Anna Lynn Patterson to US
• Application filed: July, 2004
• Published: January, 2006
Background of Invention
• Information retrieval systems, generally called
  search engines, are now an essential tool for
  finding information in large scale, diverse, and
  growing corpuses such as the Internet.

• A document is retrieved in response to a query
  containing a number of query terms, typically
  based on having some number of query terms
  present in the document.

• The retrieved documents are then ranked
  according to other statistical measures, such as
  frequency of occurrence of the query terms, host
  domain, link analysis, and the like
Cont…
• Concepts are often expressed in phrases, such
  as "Australian Shepherd," "President of the
  United States," or "Sundance Film Festival".
• Accordingly, there is a need for an information
  retrieval system and methodology that can
  identify phrases, index documents according to
  phrases, search and rank documents in
  accordance with their phrases.
Summary
  An information retrieval system and
  methodology uses phrases to index, search,
  rank, and describe documents in the document
  collection.

1. Identifying Phrases and Related Phrases
2. Indexing Documents w.r.t Phrases
3. Ranking Documents w.r.t Phrases
4. Creating description for the document
5. Elimination of Duplicate Documents
Identifying Phrase and Related
               Phrases
• Based on a phrase's ability to predict the
  presence of other phrases in a document.
• It looks to identify phrases that have
  frequent and/or distinguished/unique
  usage
• Prediction measure is used for identifying related
  phrases
• Prediction measure relates Actual co
  -occurrence rate of two phrases to expected co-
  occurrence rate of the two phrases
• Information gain = actual co-occurrence rate :
Cont…
• Two Phrases are related to each other
  when the prediction measure exceeds the
  prediction threshold.
• Example:
  Phrase : “President of the United States”
  predicts the related phrase “White House”,
  “George Bush” etc.,
Indexing documents based on
           related Phrases
• An information retrieval system indexes
  documents in the document collection by the
  valid or good phrases.
• Posting List = documents that contain the
  phrase
• Second List = used to store data indicating
  which of the related phrases of the given phrase
  are also present in each document containing
  the given phrase
Ranking

•   Ranking documents is based on two factors
      1. Ranking Documents based on Contained
    Phrases
      2. Ranking Documents based on Anchor
    Phrases
•   Document Score = Body Hit Score + Anchor Hit
    Score
•   For Example: Body Hit Score = 0.30, Anchor
    Hit Score = 0.70
•   Document Score = 0.30 + 0.70
Phrase Extension
• The information retrieval system is also adapted
  to use the phrases when searching for
  documents in response to a query.
• A user may enter an incomplete phrase in a
  search query, such as "President of the“
   Incomplete phrases such as these may be
  identified and replaced by a phrase extension,
  such as "President of the United States."
Descriptions for Documents
• Phrase information is used to create description
  of a document.
• System identifies phrase present in the query,
  related phrases and Phrase extensions in each
  sentences and have a count for each sentences.
• Ranks the sentences based on the count.
• Selects some number of top ranking sentences
  as description and includes it in the search
  results.
Eliminating Duplicate documents
• Identifying and Eliminating duplicate documents while
  crawling a document or when processing the search
  query.
• The description is stored in association with every
  document in a hash table.
• The system concatenates the newly crawled page with
  that stored hash value in the Hash table. If it finds a
  match, then it indicates that the current document is
  duplicate value.
• The system keeps the one which has higher page rank
  or more document significance and remove the duplicate
  document and will not appear in future search results for
  any query.
Functions of Indexing system

• Indentifies Phrases in documents
• Indexing Documents according to the
  phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user
• Provides those queries to the search system
Functions of Searching System

• Searching for documents relevant to the
  search query
• Identifies the phrases in the search query
• Ranking the documents

Functions of Presentation system

• Modifying the search results including
  removing of duplicate content.
• Generating topical descriptions of
  documents and provides modified
Spam Detection
• “Spam” pages have little meaningful content,
  but may instead be made up of large
  collections of popular words and phrases.
  These are sometimes referred to as “keyword
  stuffed pages”.

• Pages containing specific words and phrases
  that advertisers might be interested in are
  often called “honeypots,” and are created for
  search engines to display along with paid
  advertisements .
Cont…
• A phrase based indexing system knows the
  number of related phrases in a document.

• A normal, non-spam document will generally
  have a relatively limited number of related
  phrases, typically on the order of between 8 and
  20, depending on the document collection.

• A spam document will have an excessive
  number of related phrases, for example on the
  order of between 100 and 1000 related phrases.
Advantages of Phrase Based
            Indexing

• Detecting Duplicate Pages
• Spam Detection
• Save time
Other Patent Applications
• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information
  retrieval system

• Efficient Phrase Based Document Indexing for Document
  Clustering
According to data collected from users of European Web
 analytics provider OneStat, most people use 2- or 3-word
 queries in search engines


Two-word phrases -- 28.38 percent
Three-word phrases -- 27.15 percent
Four-word phrases -- 16.42 percent
One-word phrase -- 13.48 percent
Five-word phrases -- 8.03 percent
Six-word phrases -- 3.67 percent
Seven-word phrases -- 1.63 percent
Eight-word phrases -- 0.73 percent
Nine-word phrases -- 0.34 percent
Ten-word phrases -- 0.16 percent
Thank you

Phrase based Indexing and Information Retrieval

  • 1.
    Phrase Based Indexing By Bala Abirami
  • 2.
    • Introduction of Phrase Based Indexing • What is Phrase Based Indexing? • Back ground of Invention • Summary on Invention • Spam Detection
  • 3.
    Introduction • An informationretrieval system uses phrases to index, retrieve, organize and describe documents. • It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US • Application filed: July, 2004 • Published: January, 2006
  • 4.
    Background of Invention •Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • 5.
    Cont… • Concepts areoften expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival". • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  • 6.
    Summary Aninformation retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. 1. Identifying Phrases and Related Phrases 2. Indexing Documents w.r.t Phrases 3. Ranking Documents w.r.t Phrases 4. Creating description for the document 5. Elimination of Duplicate Documents
  • 7.
    Identifying Phrase andRelated Phrases • Based on a phrase's ability to predict the presence of other phrases in a document. • It looks to identify phrases that have frequent and/or distinguished/unique usage • Prediction measure is used for identifying related phrases • Prediction measure relates Actual co -occurrence rate of two phrases to expected co- occurrence rate of the two phrases • Information gain = actual co-occurrence rate :
  • 8.
    Cont… • Two Phrasesare related to each other when the prediction measure exceeds the prediction threshold. • Example: Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  • 9.
    Indexing documents basedon related Phrases • An information retrieval system indexes documents in the document collection by the valid or good phrases. • Posting List = documents that contain the phrase • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  • 10.
    Ranking • Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases • Document Score = Body Hit Score + Anchor Hit Score • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 • Document Score = 0.30 + 0.70
  • 11.
    Phrase Extension • Theinformation retrieval system is also adapted to use the phrases when searching for documents in response to a query. • A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  • 12.
    Descriptions for Documents •Phrase information is used to create description of a document. • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. • Ranks the sentences based on the count. • Selects some number of top ranking sentences as description and includes it in the search results.
  • 13.
    Eliminating Duplicate documents •Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. • The description is stored in association with every document in a hash table. • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  • 15.
    Functions of Indexingsystem • Indentifies Phrases in documents • Indexing Documents according to the phrases by accessing various websites. Functions of Front End Server • Receives queries from a user • Provides those queries to the search system
  • 16.
    Functions of SearchingSystem • Searching for documents relevant to the search query • Identifies the phrases in the search query • Ranking the documents Functions of Presentation system • Modifying the search results including removing of duplicate content. • Generating topical descriptions of documents and provides modified
  • 17.
    Spam Detection • “Spam”pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. • Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  • 18.
    Cont… • A phrasebased indexing system knows the number of related phrases in a document. • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  • 19.
    Advantages of PhraseBased Indexing • Detecting Duplicate Pages • Spam Detection • Save time
  • 20.
    Other Patent Applications •Phrase identification in an information retrieval system • Phrase-based searching in an information retrieval system • Phrase-based generation of document descriptions • Detecting spam documents in a phrase based information retrieval system • Efficient Phrase Based Document Indexing for Document Clustering
  • 21.
    According to datacollected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent
  • 22.