Phrase Based Indexing           By      Bala Abirami
•   Introduction of Phrase Based Indexing•   What is Phrase Based Indexing?•   Back ground of Invention•   Summary on Inve...
Introduction• An information retrieval system uses phrases to  index, retrieve, organize and describe  documents.• It was ...
Background of Invention• Information retrieval systems, generally called  search engines, are now an essential tool for  f...
Cont…• Concepts are often expressed in phrases, such  as "Australian Shepherd," "President of the  United States," or "Sun...
Summary  An information retrieval system and  methodology uses phrases to index, search,  rank, and describe documents in ...
Identifying Phrase and Related               Phrases• Based on a phrases ability to predict the  presence of other phrases...
Cont…• Two Phrases are related to each other  when the prediction measure exceeds the  prediction threshold.• Example:  Ph...
Indexing documents based on           related Phrases• An information retrieval system indexes  documents in the document ...
Ranking•   Ranking documents is based on two factors      1. Ranking Documents based on Contained    Phrases      2. Ranki...
Phrase Extension• The information retrieval system is also adapted  to use the phrases when searching for  documents in re...
Descriptions for Documents• Phrase information is used to create description  of a document.• System identifies phrase pre...
Eliminating Duplicate documents• Identifying and Eliminating duplicate documents while  crawling a document or when proces...
Functions of Indexing system• Indentifies Phrases in documents• Indexing Documents according to the  phrases by accessing ...
Functions of Searching System• Searching for documents relevant to the  search query• Identifies the phrases in the search...
Spam Detection• “Spam” pages have little meaningful content,  but may instead be made up of large  collections of popular ...
Cont…• A phrase based indexing system knows the  number of related phrases in a document.• A normal, non-spam document wil...
Advantages of Phrase Based            Indexing• Detecting Duplicate Pages• Spam Detection• Save time
Other Patent Applications• Phrase identification in an information retrieval system• Phrase-based searching in an informat...
According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in...
Thank you
Phrase based Indexing and Information Retrieval
Upcoming SlideShare
Loading in …5
×

Phrase based Indexing and Information Retrieval

747 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
747
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Phrase based Indexing and Information Retrieval

  1. 1. Phrase Based Indexing By Bala Abirami
  2. 2. • Introduction of Phrase Based Indexing• What is Phrase Based Indexing?• Back ground of Invention• Summary on Invention• Spam Detection
  3. 3. Introduction• An information retrieval system uses phrases to index, retrieve, organize and describe documents.• It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US• Application filed: July, 2004• Published: January, 2006
  4. 4. Background of Invention• Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.• A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.• The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  5. 5. Cont…• Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".• Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  6. 6. Summary An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.1. Identifying Phrases and Related Phrases2. Indexing Documents w.r.t Phrases3. Ranking Documents w.r.t Phrases4. Creating description for the document5. Elimination of Duplicate Documents
  7. 7. Identifying Phrase and Related Phrases• Based on a phrases ability to predict the presence of other phrases in a document.• It looks to identify phrases that have frequent and/or distinguished/unique usage• Prediction measure is used for identifying related phrases• Prediction measure relates Actual co -occurrence rate of two phrases to expected co- occurrence rate of the two phrases• Information gain = actual co-occurrence rate :
  8. 8. Cont…• Two Phrases are related to each other when the prediction measure exceeds the prediction threshold.• Example: Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  9. 9. Indexing documents based on related Phrases• An information retrieval system indexes documents in the document collection by the valid or good phrases.• Posting List = documents that contain the phrase• Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  10. 10. Ranking• Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases• Document Score = Body Hit Score + Anchor Hit Score• For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70• Document Score = 0.30 + 0.70
  11. 11. Phrase Extension• The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.• A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  12. 12. Descriptions for Documents• Phrase information is used to create description of a document.• System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences.• Ranks the sentences based on the count.• Selects some number of top ranking sentences as description and includes it in the search results.
  13. 13. Eliminating Duplicate documents• Identifying and Eliminating duplicate documents while crawling a document or when processing the search query.• The description is stored in association with every document in a hash table.• The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value.• The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  14. 14. Functions of Indexing system• Indentifies Phrases in documents• Indexing Documents according to the phrases by accessing various websites.Functions of Front End Server• Receives queries from a user• Provides those queries to the search system
  15. 15. Functions of Searching System• Searching for documents relevant to the search query• Identifies the phrases in the search query• Ranking the documentsFunctions of Presentation system• Modifying the search results including removing of duplicate content.• Generating topical descriptions of documents and provides modified
  16. 16. Spam Detection• “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”.• Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  17. 17. Cont…• A phrase based indexing system knows the number of related phrases in a document.• A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.• A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  18. 18. Advantages of Phrase Based Indexing• Detecting Duplicate Pages• Spam Detection• Save time
  19. 19. Other Patent Applications• Phrase identification in an information retrieval system• Phrase-based searching in an information retrieval system• Phrase-based generation of document descriptions• Detecting spam documents in a phrase based information retrieval system• Efficient Phrase Based Document Indexing for Document Clustering
  20. 20. According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search enginesTwo-word phrases -- 28.38 percentThree-word phrases -- 27.15 percentFour-word phrases -- 16.42 percentOne-word phrase -- 13.48 percentFive-word phrases -- 8.03 percentSix-word phrases -- 3.67 percentSeven-word phrases -- 1.63 percentEight-word phrases -- 0.73 percentNine-word phrases -- 0.34 percentTen-word phrases -- 0.16 percent
  21. 21. Thank you

×