• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,217
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Phrase Based Indexing By Bala Abirami
  • 2.
    • Introduction of Phrase Based Indexing
    • What is Phrase Based Indexing?
    • Back ground of Invention
    • Summary on Invention
    • Spam Detection
  • 3. Introduction
    • An information retrieval system uses phrases to index, retrieve, organize and describe documents.
    • It was a patent application submitted by the Google Engineer , Anna Lynn Patterson to US
    • Application filed: July, 2004
    • Published: January, 2006
  • 4. Background of Invention
    • Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.
    • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.
    • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • 5. Cont…
    • Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".
    • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  • 6. Summary
    • An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.
    • 1. Identifying Phrases and Related Phrases
    • 2. Indexing Documents w.r.t Phrases
    • 3. Ranking Documents w.r.t Phrases
    • 4. Creating description for the document
    • 5. Elimination of Duplicate Documents
  • 7. Identifying Phrase and Related Phrases
    • Based on a phrase's ability to predict the presence of other phrases in a document.
    • It looks to identify phrases that have frequent and/or distinguished/unique usage
    • Prediction measure is used for identifying related phrases
    • Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases
    • Information gain = actual co-occurrence rate : expected co-occurrence rate
  • 8. Cont…
    • Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.
    • Example:
    • Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  • 9. Indexing documents based on related Phrases
    • An information retrieval system indexes documents in the document collection by the valid or good phrases.
    • Posting List = documents that contain the phrase
    • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  • 10. Ranking
    • Ranking documents is based on two factors
    • 1. Ranking Documents based on Contained Phrases
    • 2. Ranking Documents based on Anchor Phrases
    • Document Score = Body Hit Score + Anchor Hit Score
    • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70
    • Document Score = 0.30 + 0.70
  • 11. Phrase Extension
    • The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.
    • A user may enter an incomplete phrase in a search query, such as "President of the“
    • Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  • 12. Descriptions for Documents
    • Phrase information is used to create description of a document.
    • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.
    • Ranks the sentences based on the count.
    • Selects some number of top ranking sentences as description and includes it in the search results.
  • 13. Eliminating Duplicate documents
    • Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.
    • The description is stored in association with every document in a hash table.
    • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.
    • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  • 14.  
  • 15.
    • Functions of Indexing system
    • Indentifies Phrases in documents
    • Indexing Documents according to the phrases by accessing various websites.
    • Functions of Front End Server
    • Receives queries from a user
    • Provides those queries to the search system
  • 16.
    • Functions of Searching System
    • Searching for documents relevant to the search query
    • Identifies the phrases in the search query
    • Ranking the documents
    • Functions of Presentation system
    • Modifying the search results including removing of duplicate content.
    • Generating topical descriptions of documents and provides modified
  • 17. Spam Detection
    • “ Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.
    • Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  • 18. Cont…
    • A phrase based indexing system knows the number of related phrases in a document.
    • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.
    • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  • 19. Advantages of Phrase Based Indexing
    • Detecting Duplicate Pages
    • Spam Detection
    • Save time
  • 20. Other Patent Applications
    • Phrase identification in an information retrieval system
    • Phrase-based searching in an information retrieval system
    • Phrase-based generation of document descriptions
    • Detecting spam documents in a phrase based information retrieval system
    • Efficient Phrase Based Document Indexing for Document Clustering
  • 21.
    • According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines
    • Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent
  • 22. Thank you