GROUPER: A DYNAMIC CLUSTERING
INTERFACE TO WEB SEARCH RESULTS
Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY, BANGALORE-56
Proposed Solution & Goals
Proposed Solution & Goals
How Groupers work??
How Groupers work??
Search engine results are not easy to browse
Problem of search engine
• Search engine return long ordered list of document
Ranked list presentation.
Users forced to sift through to find relevant
Wastage of time.
Alternative method for organizing retrieval
Algorithms groups the documents based on their
Easy to locate.
Overview of retrieved document set.
Post- retrieval Document Clustering
Clusters computed based on returned doc set.
Cluster boundaries appropriately partition set of
documents at hand.
Pre-Retrieval document clustering
Offline clustering of documents.
Document clustering performed in advance on
the collection as whole.
Might be based on features infrequent in
Problem with search engines
Severe resource constraints.
Cannot dedicate enough CPU time to each
query – NOT FEASIBLE.
Hence clusters have to be PRE-COMPUTED.
clustering interface to HuskySearch
meta search service.
HuskySearch meta-search engine:
Based on MetaCrawler.
Retrieves results from several popular web search
Clusters results using STC algorithm.
Addresses scalability issue.
No additional resource demands on search
Runs on client machine.
Suitable for distributed IR systems.
Group similar documents together.
Cluster description must clusters when appropriate.
Clustering can be done in 2 ways:
b)Download and cluster.
Overview of STC Algorithm
Linear time clustering alg.
Based on identifying phrases common to group
PHRASE:Ordered sequence of one or more
BASE CLUSTER:Set of documents that share a
STC has 3 logical steps
Transformation- using Light stemming Alg.
2)Identification of Base are marked; non-word
Sentence boundaries Clusters:
tokens are stripped.
Inverted Base Clusters intousing a D.S. called
3)Merging index of phrases- clusters:
SUFFIXdegree of overlap.
sentence cluster assigned a SCORE.
Clusters ; coherent
SCORE(No. of doc’s,No. of words in phrase).
Stoplist is maintained.
Overlapping clusters ; Shared Phrases.
Fast and incremental.
Doesnot coerce the documents in predefined
number of clusters.
DESIGN FOR SPEED
3 characteristics that make Grouper fast:
1)Incrementally of Clustering Algorithm.
STC performsuse free CPU time.comparisons.
Grouper can large no. of string
3)Ability to form coherent into a unique integer.
Each word result immediately after last document arrives.
Produces transformed clusters based on snippets.
Faster comparisons. results:
2 modes of clustering
Documents of each base cluster encoded as bit vector
a) Cluster the snippets (fast).
for efficient calculation of document overlap.
b) Download and cluster
Additional speedup: (high clustering quality)
a)Remove leading and ending stopped words. Eg:the vice
president of – vice president.
b)Strip off words that do not appear in minimal no. of
EMPIRICAL EVALUATION OF
Heterogeneous user population.
Search for a wide variety of tasks.
Documents retrieved in Husky
STC Producesdoc’s followed
Same no. of coherent clusters.
Search sessions clusters using:
Calculate no. of clustered
K-means clustering algorithm.
Comparison to a Ranked List
Compared with HuskySearch based on:
1. Number of documents followed
2. Time spent
3. Click distance
No. of doc’s followed by users
3 hypothesis made:
1)Easier to find interesting doc.
2)Help find additional interesting doc.
3)Helps in tasks where several doc’s required.
Percentage of sessions in which users followed multiple
documents is higher in Grouper
Time spent on each doc followed
Time spent = time to download
Time Spent= time spent in network delays+ time in reading
+time traversing the results
doc’s+time into view selected doc presentation.
+time to find next doc of interest
it’s the time between a user’s request for doc and user’s
Distance between successive user’s clicks
on document set.
In ranked list interface:
Click distance= no. of snippets between 2
22 snippets scanned
In clustering interface:
Additional cost of skipping snippets.
Any cluster visited; all snippets are scanned. 4
Empirical assessment of user behavior given a clustering interface
to web search results.
• Comparison to the logs of Husky Search.
1)May fail to capture semantic distinctions that user’s expect-while
merging base clusters into clusters.
2)Difficult to navigate if num of clusters are more.
Solution: Grouper II
1)Allows users to view non merged base clusters.
2)Supports a hierarchal and interactive interface.