This document discusses web clustering engines, which group search results returned by a search engine into a hierarchy of labeled clusters. It describes advantages like allowing better topic understanding. Key components of clustering engines are search result acquisition, preprocessing like tokenization, and clustering algorithms like agglomerative hierarchical clustering. Issues in implementing clusters are also outlined, as well as techniques to improve efficiency like client-side processing and using pretokenized documents.
2. CONTENTS
● Introduction
● Why web clustering engines?
● Advantages of cluster hierarchy
● Issues in implementation of clusters
● Architecture
● Data centric clustering algorithm
● Conclusion
3. Search engines ?
● Search engines are an invaluable tool for retrieving information from the web. In
response to a user query, they return a list of results ranked in order of relevance to
the query.
● Eg : Google, Yahoo, Credo etc.
6. Web clustering engines
● Search engine.
● Web Clustering Engines are the systems that perform clustering of web search
results. This systems group the results returned by a search engine into a hierarchy of
labeled clusters (also called categories).
● Clustering is the act of grouping similar objects into sets.
● The distance between the objects in the same cluster should be minimum.
● And the distance between objects in the different clusters should be maximum.
7. Web clustering engines -
1. Northern Light (predefined set of clusters )
2. Vivisimo - Cluster labels were dynamically generated.
3. Clusty
4. Grokker
5. Yippy
6. Lingo3G
7. Credo etc..
8. Why web clustering engines ?
● Conventional engines are not much efficient in ‘Ambiguous’ queries.
● The search results returned by conventional search engines on query will be
mixed together in the list, irrelevant item occurs.
In this context clustering of search results come into picture!!
9. Main advantages of cluster hierarchy :
● It makes for shortcuts to the items that relate to the same meaning.
● It allows better topic understanding.
● It favors systematic exploration of search results.
10. Issues in implementation of clusters :
● Short input description.
● Meaningful labels.
● Selection of similarity measure.
● Grouping of objects into clusters.
● Computational efficiency.
● Overlapping clusters.
● Unknown number of clusters.
12. 1. Search Result Acquisition :
● The task of the search result acquisition is to provide input for the rest of the system.
● Based on the query, the acquisition component must deliver 50 to 500 results, each of
which should contain -
■ Title
■ Contextual snippet
■ URL pointing to the full text being referred to.
● The source of search results can be any public search engines, such as google, yahoo etc.
● The most elegant way of fetching results from such search engines is by using application
programming interfaces(APIs) these engines provide.
13. 2. Preprocessing of search results :
● It converts the contents of search results (output by the acquisition component) into a
sequence of features used by the actual clustering algorithm.
● Steps for feature extraction -
a. Language identification
b. Tokenization
c. Stemming
d. Selection of features.
14. b. Tokenization :
● During the tokenization step, the text of each search result gets split into a sequence of
basic independent units called tokens, which will usually represent single words, numbers,
symbols and so on.
● Tokenization becomes much more complex for languages where white spaces are not
present (such as Chinese) or where the text may switch direction (such as an Arabic text).
15. c. Stemming :
● The aim of stemming is to remove the inflectional prefixes and suffixes of each word and
thus reduce different grammatical forms of the word to a common base form called a stem.
● Eg.
Connected, Connecting and interconnected
‘Connect’
16. d. Selection features :
● It extract features for each search result present in the input.
● Features are atomic entities by which we can describe an object and represent its most
important characteristic to an algorithm.
● The features can vary from single words and fixed-length tuples of words (n-grams) to
frequent phrases (variable-length sequences of words)
17. How to represent a feature/text ?
● One method for representing a text is Vector Space model(VSM).
● A document d is represented in the VSM as a vector [wt0 , wt1, . . .wtn], where t0, t1, . . . tn is
a global set of words (features) and wti expresses the weight (importance) of feature ti to
document d.
● Eg. :
d-> “Polly had a dog and the dog had Polly”
18. 3. Cluster construction and labelling :
● The set of search results along with their features, extracted in the preprocessing step, are
given as input to the clustering algorithm.
● There are a number of algorithms available for clustering. We can classify them into two
different categories -
a. Data centric Clustering algorithm
b. Description aware.
● The clusters labels should be unique, unambiguous, comprehensive and sensible to the
content.
19. Data centric clustering algorithm :
● This system uses VSM for text representation and the clustering technique used is
agglomerative hierarchical clustering (AHC).
● It has an initial clustering of a collection of documents in a set of k clusters(scattering).
● .At Query time the user selected clusters of interest(gather) and the system re-clustered
those documents.
● This process repeats until a small cluster with relevant documents is found.
20. Agglomerative Hierarchical Clustering(AHC) :
● Initially each document is in its own cluster.
● It build a distance matrix (dissimilarity matrix) for every pair of clusters.
● Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one
cluster.
● Continue this process until the desired no of k clusters reached.
● The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the
number of clusters.
21. Improve efficiency of clustering
1. Client side processing : During high query rate periods the response times can significantly
increase and thus degrade the user experience. For avoiding this we can do some processes
using the client side resources.
2. Pretokenized Documents : Clustering engines can use tokens that are already used by the
conventional search engines.
22. Conclusion
● Web clustering engines organize search results by topic, thus offering a
complementary view to the flat-ranked list returned by conventional search engines.
● Due to lack of efficient methods for the performance evaluation of clustering engines
they are not seeking the attention of the people.