Grouper

234
-1

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
234
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Grouper

  1. 1. TECHNICAL SEMINAR ON GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS BY PREET KANWAL Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY, BANGALORE-56
  2. 2. OUTLINE Problem Definition. Problem Definition. Proposed Solution & Goals Proposed Solution & Goals How Groupers work?? How Groupers work?? Empirical Evolution Empirical Evolution Conclusion Conclusion
  3. 3. PROBLEM DEFINITION Search engine results are not easy to browse
  4. 4. Problem of search engine • Search engine return long ordered list of document “snippets”.
  5. 5. Disadvantage  Ranked list presentation. Users forced to sift through to find relevant document.  Wastage of time.  Low precision.
  6. 6. Document clustering  Alternative method for organizing retrieval results.  Algorithms groups the documents based on their similarities. Advantages:  Easy to locate.  Overview of retrieved document set.
  7. 7. Document Clustering Pre-Retrieval method Post-retrieval method
  8. 8. Post- retrieval Document Clustering  Superior results.  Clusters computed based on returned doc set.  Cluster boundaries appropriately partition set of documents at hand.
  9. 9. Pre-Retrieval document clustering Offline clustering of documents. Document clustering performed in advance on the collection as whole. Might be based on features infrequent in retrieved set.
  10. 10. Problem with search engines Severe resource constraints. Cannot dedicate enough CPU time to each query – NOT FEASIBLE. Hence clusters have to be PRE-COMPUTED.
  11. 11. PROPOSED SOLUTION GROUPER: Document clustering interface to HuskySearch meta search service. HuskySearch meta-search engine: Based on MetaCrawler. Retrieves results from several popular web search engines. Clusters results using STC algorithm.
  12. 12. Advantages Easily browsable. Addresses scalability issue. No additional resource demands on search engine. Fast. Runs on client machine. Suitable for distributed IR systems.
  13. 13. Goals 1)Coherent Clusters:  Group similar documents together. 2)Efficiently Browsable:  Generate overlapping Cluster description must clusters when appropriate. be3)Speed: Algorithmic Speed. Concise. Accurate. Snippet tolerance. Clustering can be done in 2 ways: a)Clustering snippets. b)Download and cluster.
  14. 14. Overview of STC Algorithm  Linear time clustering alg.  Based on identifying phrases common to group of documents. PHRASE:Ordered sequence of one or more words. BASE CLUSTER:Set of documents that share a common phrase.
  15. 15. STC has 3 logical steps 1)Document “cleaning”:  Transformation- using Light stemming Alg. 2)Identification of Base are marked; non-word  Sentence boundaries Clusters: tokens are stripped.  Inverted Base Clusters intousing a D.S. called 3)Merging index of phrases- clusters: Eg: Hello..!! SUFFIXdegree of overlap. High TREE. sentence cluster assigned a SCORE. non-word token Each baseboundarysemantically.(shared Clusters ; coherent SCORE(No. of doc’s,No. of words in phrase). Hello ..!! phrases) Stoplist is maintained.
  16. 16. STC Characteristics  Overlapping clusters ; Shared Phrases.  Fast and incremental.  Doesnot coerce the documents in predefined number of clusters.
  17. 17. User Interface Grouper’s Query Interface
  18. 18. A Query Result Summary of cluster
  19. 19. Refine Query Based On This Cluster
  20. 20. DESIGN FOR SPEED 3 characteristics that make Grouper fast: 1)Incrementally of Clustering Algorithm.  STC incremental. 2)Efficient Implementation. STC performsuse free CPU time.comparisons. Grouper can large no. of string 3)Ability to form coherent into a unique integer. Each word result immediately after last document arrives. Produces transformed clusters based on snippets. Faster comparisons. results:  2 modes of clustering Documents of each base cluster encoded as bit vector a) Cluster the snippets (fast). for efficient calculation of document overlap. b) Download and cluster Additional speedup: (high clustering quality) a)Remove leading and ending stopped words. Eg:the vice president of – vice president. b)Strip off words that do not appear in minimal no. of documents.
  21. 21. EMPIRICAL EVALUATION OF GROUPER Difficult. Heterogeneous user population. Search for a wide variety of tasks. Documents retrieved in Husky STC Producesdoc’s followed Same no. of coherent clusters. Search sessions clusters using: Calculate no. of clustered STC algorithm followed K-means clustering algorithm. STC>K-means
  22. 22. Comparison to a Ranked List Display Compared with HuskySearch based on: 1. Number of documents followed 2. Time spent 3. Click distance
  23. 23. No. of doc’s followed by users 3 hypothesis made: 1)Easier to find interesting doc. 2)Help find additional interesting doc. 3)Helps in tasks where several doc’s required. Percentage of sessions in which users followed multiple documents is higher in Grouper
  24. 24. Time spent on each doc followed Time spent = time to download Time Spent= time spent in network delays+ time in reading +time traversing the results doc’s+time into view selected doc presentation. +time to find next doc of interest or it’s the time between a user’s request for doc and user’s previous request.
  25. 25. Click distance Distance between successive user’s clicks on document set. In ranked list interface: Click distance= no. of snippets between 2 clicks. 22 snippets scanned In clustering interface: 1 1 1 Additional cost of skipping snippets. 2 2 2 3 3 3 Any cluster visited; all snippets are scanned. 4 4 4 5 . . . . . . 20 18 Cluster 1 5 . . . . . . 20 5 . . . . . . 20 Cluster 2 Cluster 3 4
  26. 26. CONCLUSION • • Grouper Empirical assessment of user behavior given a clustering interface to web search results. • Comparison to the logs of Husky Search. • Problems: 1)May fail to capture semantic distinctions that user’s expect-while merging base clusters into clusters. 2)Difficult to navigate if num of clusters are more. • Solution: Grouper II 1)Allows users to view non merged base clusters. 2)Supports a hierarchal and interactive interface.

×