Phrase Based Indexing and Information Retrivel

3,458 views

Published on

Slide on Phrase based indexing and information retrivel.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,458
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
54
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Phrase Based Indexing and Information Retrivel

  1. 1. Phrase Based Indexing By Bala Abirami
  2. 2. <ul><li>Introduction of Phrase Based Indexing </li></ul><ul><li>What is Phrase Based Indexing? </li></ul><ul><li>Back ground of Invention </li></ul><ul><li>Summary on Invention </li></ul><ul><li>Spam Detection </li></ul>
  3. 3. Introduction <ul><li>An information retrieval system uses phrases to index, retrieve, organize and describe documents. </li></ul><ul><li>It was a patent application submitted by the Google Engineer , Anna Lynn Patterson to US </li></ul><ul><li>Application filed: July, 2004 </li></ul><ul><li>Published: January, 2006 </li></ul>
  4. 4. Background of Invention <ul><li>Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. </li></ul><ul><li>A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. </li></ul><ul><li>The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like </li></ul>
  5. 5. Cont… <ul><li>Concepts are often expressed in phrases, such as &quot;Australian Shepherd,&quot; &quot;President of the United States,&quot; or &quot;Sundance Film Festival&quot;. </li></ul><ul><li>Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases. </li></ul>
  6. 6. Summary <ul><li>An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. </li></ul><ul><li>1. Identifying Phrases and Related Phrases </li></ul><ul><li>2. Indexing Documents w.r.t Phrases </li></ul><ul><li>3. Ranking Documents w.r.t Phrases </li></ul><ul><li>4. Creating description for the document </li></ul><ul><li>5. Elimination of Duplicate Documents </li></ul>
  7. 7. Identifying Phrase and Related Phrases <ul><li>Based on a phrase's ability to predict the presence of other phrases in a document. </li></ul><ul><li>It looks to identify phrases that have frequent and/or distinguished/unique usage </li></ul><ul><li>Prediction measure is used for identifying related phrases </li></ul><ul><li>Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases </li></ul><ul><li>Information gain = actual co-occurrence rate : expected co-occurrence rate </li></ul>
  8. 8. Cont… <ul><li>Two Phrases are related to each other when the prediction measure exceeds the prediction threshold. </li></ul><ul><li>Example: </li></ul><ul><li>Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc., </li></ul>
  9. 9. Indexing documents based on related Phrases <ul><li>An information retrieval system indexes documents in the document collection by the valid or good phrases. </li></ul><ul><li>Posting List = documents that contain the phrase </li></ul><ul><li>Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase </li></ul>
  10. 10. Ranking <ul><li>Ranking documents is based on two factors </li></ul><ul><li>1. Ranking Documents based on Contained Phrases </li></ul><ul><li>2. Ranking Documents based on Anchor Phrases </li></ul><ul><li>Document Score = Body Hit Score + Anchor Hit Score </li></ul><ul><li>For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 </li></ul><ul><li>Document Score = 0.30 + 0.70 </li></ul>
  11. 11. Phrase Extension <ul><li>The information retrieval system is also adapted to use the phrases when searching for documents in response to a query. </li></ul><ul><li>A user may enter an incomplete phrase in a search query, such as &quot;President of the“ </li></ul><ul><li>Incomplete phrases such as these may be identified and replaced by a phrase extension, such as &quot;President of the United States.&quot; </li></ul>
  12. 12. Descriptions for Documents <ul><li>Phrase information is used to create description of a document. </li></ul><ul><li>System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. </li></ul><ul><li>Ranks the sentences based on the count. </li></ul><ul><li>Selects some number of top ranking sentences as description and includes it in the search results. </li></ul>
  13. 13. Eliminating Duplicate documents <ul><li>Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. </li></ul><ul><li>The description is stored in association with every document in a hash table. </li></ul><ul><li>The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. </li></ul><ul><li>The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query. </li></ul>
  14. 15. <ul><li>Functions of Indexing system </li></ul><ul><li>Indentifies Phrases in documents </li></ul><ul><li>Indexing Documents according to the phrases by accessing various websites. </li></ul><ul><li>Functions of Front End Server </li></ul><ul><li>Receives queries from a user </li></ul><ul><li>Provides those queries to the search system </li></ul>
  15. 16. <ul><li>Functions of Searching System </li></ul><ul><li>Searching for documents relevant to the search query </li></ul><ul><li>Identifies the phrases in the search query </li></ul><ul><li>Ranking the documents </li></ul><ul><li>Functions of Presentation system </li></ul><ul><li>Modifying the search results including removing of duplicate content. </li></ul><ul><li>Generating topical descriptions of documents and provides modified </li></ul>
  16. 17. Spam Detection <ul><li>“ Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. </li></ul><ul><li>Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements . </li></ul>
  17. 18. Cont… <ul><li>A phrase based indexing system knows the number of related phrases in a document. </li></ul><ul><li>A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. </li></ul><ul><li>A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. </li></ul>
  18. 19. Advantages of Phrase Based Indexing <ul><li>Detecting Duplicate Pages </li></ul><ul><li>Spam Detection </li></ul><ul><li>Save time </li></ul>
  19. 20. Other Patent Applications <ul><li>Phrase identification in an information retrieval system </li></ul><ul><li>Phrase-based searching in an information retrieval system </li></ul><ul><li>Phrase-based generation of document descriptions </li></ul><ul><li>Detecting spam documents in a phrase based information retrieval system </li></ul><ul><li>Efficient Phrase Based Document Indexing for Document Clustering </li></ul>
  20. 21. <ul><li>According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines </li></ul><ul><li>Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent </li></ul>
  21. 22. Thank you

×