1. Aristotle University of Thessaloniki
School of Computer Science - Master Studies - Spring Semester
Course: Web Information Mining and Retrieval
Instructor: Vakali Athina
Kouroupetroglou
Praxitelis Nikolaos
Incremental Clustering
In Search Engines
2. Search engines and results retrieval
● Conventional document retrieval systems return long lists of ranked documents
● Search engines with low precision
● hard for users to find the information they are looking for.
● Improvements: filtering methods, advanced pruning options, clustering
● (-) clustering algorithms rely on off-line clustering of the entire document collection
● Clustering has to be applied to the much smaller set of documents returned in
response to a query.
3. Clustering and search engines - Key concepts
● Relevance: group documents relevant to document’s context and the user’s query
● Browsable Summaries: The user needs to watch at a glance whether a cluster's
contents are of interest
● Overlap: Since documents have multiple topics, it is important to avoid confining
each document to only one cluster
● Snippet-tolerance: high quality clusters even when it only has access to the snippets
returned by the search engines, as most users are unwilling to wait while the system
downloads the original documents off the Web.
● Speed: fast clustering for impatient users
● Incrementality: To save time, the method should start to process each snippet as
soon as it is received over the Web.
4. Suffix Tree Clustering (STC)
● From Department of Computer Science and Engineering, University of Washington
● a novel, incremental, O(n) time algorithm
● Treats a document as a string
● use of proximity information between words.
● STC relies on a suffix tree to efficiently identify sets of documents that share
common phrases
● uses this information to create clusters and to summarize their contents
● MetaCrawler-STC, to test it out
5. STC Steps
● Step 1 - Document "Cleaning"
○ Light stemming (deleting prefixes, suffixes, plural to singular form)
○ Remove html tags
○ Transform each in string and the document in string array having pointers to each word
● Step 2 - Identifying Base Clusters
○ Creating a Suffix tree structure, constructed in time linear and incrementally as the
documents are being read
○ Each Node contains a list of phrases and a list of document with this common phrases
● Step 3 - Combining Base Clusters
○ Combine base clusters with a binary similarity function,
○ Sim is 1 iff prerequisites are met, 0 otherwise
○ Usually top k clusters are kept, there are of interest
○ Score function:
● Images and functions from [1]
7. Advantages - Experiments
● STC in incremental, Each new
document, is added to the suffix tree.
Nodes updated/created. Updating
the relevant base clusters and
recalculating the similarity of these
base clusters to the rest of the
clusters.
● Linear time (inserting and cleaning
document and creating new clusters)
Image from [1]
8. References
● [1] Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren
Etzioni Department of Computer Science and Engineering University of Washington
Seattle, WA 98195-2350 U.S.A.
● [2] Suffix Tree, https://en.wikipedia.org/wiki/Suffix_tree
● [3] Suffix Tree Clustering, https://en.wikipedia.org/wiki/Suffix_tree_clustering
9. Aristotle University of Thessaloniki
School of Computer Science - Master Studies - Spring Semester
Course: Web Information Mining and Retrieval
Instructor: Vakali Athina
Kouroupetroglou
Praxitelis Nikolaos
Incremental Clustering
In Search Engines