• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Phrase Based Indexing

Phrase Based Indexing



Slide on Phrase Based Indexing concepts.

Slide on Phrase Based Indexing concepts.



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.fachak.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Phrase Based Indexing Phrase Based Indexing Presentation Transcript

    • Phrase Based Indexing By Bala Abirami
      • Introduction of Phrase Based Indexing
      • What is Phrase Based Indexing?
      • Back ground of Invention
      • Summary on Invention
      • Spam Detection
    • Introduction
      • An information retrieval system uses phrases to index, retrieve, organize and describe documents.
      • It was a patent application submitted by the Google Engineer , Anna Lynn Patterson to US
      • Application filed: July, 2004
      • Published: January, 2006
    • Background of Invention
      • Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.
      • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.
      • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
    • Cont…
      • Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".
      • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
    • Summary
      • An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.
      • 1. Identifying Phrases and Related Phrases
      • 2. Indexing Documents w.r.t Phrases
      • 3. Ranking Documents w.r.t Phrases
      • 4. Creating description for the document
      • 5. Elimination of Duplicate Documents
    • Identifying Phrase and Related Phrases
      • Based on a phrase's ability to predict the presence of other phrases in a document.
      • It looks to identify phrases that have frequent and/or distinguished/unique usage
      • Prediction measure is used for identifying related phrases
      • Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases
      • Information gain = actual co-occurrence rate : expected co-occurrence rate
    • Cont…
      • Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.
      • Example:
      • Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
    • Indexing documents based on related Phrases
      • An information retrieval system indexes documents in the document collection by the valid or good phrases.
      • Posting List = documents that contain the phrase
      • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
    • Ranking
      • Ranking documents is based on two factors
      • 1. Ranking Documents based on Contained Phrases
      • 2. Ranking Documents based on Anchor Phrases
      • Document Score = Body Hit Score + Anchor Hit Score
      • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70
      • Document Score = 0.30 + 0.70
    • Phrase Extension
      • The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.
      • A user may enter an incomplete phrase in a search query, such as "President of the“
      • Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
    • Descriptions for Documents
      • Phrase information is used to create description of a document.
      • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.
      • Ranks the sentences based on the count.
      • Selects some number of top ranking sentences as description and includes it in the search results.
    • Eliminating Duplicate documents
      • Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.
      • The description is stored in association with every document in a hash table.
      • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.
      • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
      • Functions of Indexing system
      • Indentifies Phrases in documents
      • Indexing Documents according to the phrases by accessing various websites.
      • Functions of Front End Server
      • Receives queries from a user
      • Provides those queries to the search system
      • Functions of Searching System
      • Searching for documents relevant to the search query
      • Identifies the phrases in the search query
      • Ranking the documents
      • Functions of Presentation system
      • Modifying the search results including removing of duplicate content.
      • Generating topical descriptions of documents and provides modified
    • Spam Detection
      • “ Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.
      • Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
    • Cont…
      • A phrase based indexing system knows the number of related phrases in a document.
      • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.
      • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
    • Advantages of Phrase Based Indexing
      • Detecting Duplicate Pages
      • Spam Detection
      • Save time
    • Other Patent Applications
      • Phrase identification in an information retrieval system
      • Phrase-based searching in an information retrieval system
      • Phrase-based generation of document descriptions
      • Detecting spam documents in a phrase based information retrieval system
      • Efficient Phrase Based Document Indexing for Document Clustering
      • According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines
      • Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent
    • Thank you