2. Search Engine?
• Search engines are an invaluable tool for
retrieving information from the Web.
In response to a user query, they return a
list of results ranked in order of relevance
to the query.
• Eg: Google,Yahoo,Credo,Grokker etc.
Arun TR
14,S7CS
3. • Google (Flat Ranked Search Engine)
Arun TR
14,S7CS
Flat Ranked VS Clustered
5. Why Web Clustering
Engines?
• Conventional Engines are not much
efficient in ‘Ambiguous’ queries.
• The search results returned by
conventional search engines on query will
be mixed together in the list,irrelevant
items occurs.
In this context clustering of search results
come in to picture!!
Arun TR
14,S7CS
6. • Search engine
• Clustering is the act of grouping similar
object into sets.
• The distance between the objects in the
same cluster(inter-cluster variations)
should be minimum
• The distance between objects in different
clusters(intra-cluster variations) should be
maximum.
Web Clustering Engines?
Arun TR
14,S7CS
7. • This systems group the results returned by
a search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Vivısimo - cluster labels were dynamically generated
3. Clusty,
4. Grokker,
5. KartOO,
6. Lingo3G,
7. CREDO,etc
Arun TR
14,S7CS
8. Main advantages of the
cluster hierarchy
• It makes for shortcuts to the items that relate to
the same meaning.
• It allows better topic understanding.
• It favors systematic exploration of search
results.
Arun TR
14,S7CS
9. • Short input data description.
• Meaningful labels.
• Selection of similarity measure.
• Grouping of objects into clusters.
• Computational efficiency.
• Unknown number of clusters.
Issues in Implementation Of
clusters
Arun TR
14,S7CS
11. 1.Search Results Acquisition
• Provides input for the rest of the system.
• Based on the query, the acquisition
component must deliver 50 to 500 results,
each of which should contain a title, a
contextual snippet, and the URL
• The source of search results can be any
public search engines, such as
Google,Yahoo etc.
• Fetching results from other search
engines by API of these engines.
Arun TR
14,S7CS
12. 2.Preprocessing of Search
results
• Primary aim is to convert the search
results into ‘features’
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features
Arun TR
14,S7CS
13. ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word,number or
symbol.
More complex for languages where white
spaces are not present (such as Chinese)
or switch direction (such as an Arabic text).
Arun TR
14,S7CS
14. iii.Stemming:
Remove the inflectional prefixes and suffixes
of each word to reduce different grammatical
form of the word to a common base form
called a ‘stem’.
Eg:
connected,connecting & interconnection
↓ ↓ ↓
‘connect’
Arun TR
14,S7CS
15. iv.Selection features:
•Extract features for each search result
present in the input.
•Features are atomic entities by which we
can describe an object and represent its
most important characteristic to an
algorithm.
•Features vary from single word to tuples of
word.
Arun TR
14,S7CS
16. How can represent a feature/text?
• Vector Space Model(VSM)
• Document d is represented in the VSM as a
vector [wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
d→“Polly had a dog and the dog had Polly”
vsm representation
Arun TR
14,S7CS
17. 3.Cluster Construction &
Labelling
• The set of search results along with their
features are input to the clustering algorithm,
for building the clusters and labeling.
Two types of Algorithms:
→Data centric clustering algorithm
→Description aware –STC related
• Created cluster should be aptly labled.
i.Unique ii.Unambiguous iii.Comprehensive
iv.Sensible to the content
Arun TR
14,S7CS
18. Data Centric Clustering Algorithm
• Similar to Agglomerative Hierarchical
Clustering (AHC) with an average-link
merge criterion.
• It has initial clustering of a collection of
documents in a set of k clusters(scatter)
• At Query time the user selected clusters of
interest(gather) and the system re-
clustered those documents.
• Process repeats until a small cluster with
relevant documents is found
Arun TR
14,S7CS
20. • Bottom up approach. Initially each
document is in its own cluster.
• Build a distance matrix for every pair of
clusters. Merge 2 closest clusters and
build the new distance matrix by replacing
the merged cluster by one cluster.
• Continue this process until the desired no
of k clusters reached.
• The Complexity of this algorithm is clearly
O(n2
), n: number of clusters
• Another Data centric algorithm is called as
K-means clustering
Arun TR
14,S7CS
21. Difficulties in Data centric
algorithms
• All these algorithms are not incremental in
nature - each document arrives from the
web,we “clean” it and add it to the
available model.
• Missing of meaningful labels.
Arun TR
14,S7CS
22. 4.Visualization of Clustered
Results
• One prominent approach is based on hierarchical folders
• Clusty, CREDO, Lingo3G - hierarchical folder visualization
approach
• Grokker - Nesting ,zooming approach
• KartOO - Graph based interfaces
Arun TR
14,S7CS
25. Improve Efficiency of
Clustering
• Client side processing:High query rate
periods the response times can significantly
increase. Some processes using the client
side resources
• Incremental processing:As each
document arrives from the web, we “clean”
it and add it to the available model.
• Pretokenized documents:Clustering
engines can use tokens that already used
by the conventional search engines.
Arun TR
14,S7CS
26. Conclusion
Web clustering engines organize search results by
topic, thus offering a complementary view to the
flat-ranked list returned by conventional search
engines. A number of advances must be made to
improve the cluster labels, coherence of cluster
structure, performance evaluation studies,advanced
visualization techniques. Then Web Clustering
Engines entirely fulfills the promise of being the
PageRank of the future.
Due to the lack of an efficient method for the
performance evaluation of clustering engines they
are still not seeking the attention of people.
Arun TR
14,S7CS