2. Search Engine?
• Search engines are an invaluable tool for
retrieving information from the Web.
In response to a user query, they return a
list of results ranked in order of relevance
to the query.
• Eg: Google,Yahoo,Credo,Grokker etc.
3. • Google (Flat Ranked Search Engine)
Flat Ranked VS Clustered
5. Why Web Clustering
Engines?
• Conventional Engines are not much
efficient in ‘Ambiguous’ queries.
• The search results returned by
conventional search engines on query will
be mixed together in the list,irrelevant
items occurs.
In this context clustering of search results
come in to picture!!
6. • Search engine
• Clustering is the act of grouping similar
object into sets.
• The distance between the objects in the
same cluster(inter-cluster variations)
should be minimum
• The distance between objects in different
clusters(intra-cluster variations) should be
maximum.
Web Clustering Engines?
7. • This systems group the results returned by
a search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Vivısimo - cluster labels were dynamically generated
3. Clusty,
4. Grokker,
5. KartOO,
6. Lingo3G,
7. CREDO,etc
8. • Short input data description.
• Meaningful labels.
• Selection of similarity measure.
• Grouping of objects into clusters.
• Computational efficiency.
• Unknown number of clusters.
Issues in Implementation Of
clusters
10. Search Results Acquisition
• Provides input for the rest of the system.
• Based on the query, the acquisition
component must deliver 50 to 500 results,
each of which should contain a title, a
contextual snippet, and the URL
• The source of search results can be any
public search engines, such as
Google,Yahoo etc.
• Fetching results from other search
engines by API of these engines.
11. Preprocessing of Search
results
• Primary aim is to convert the search
results into ‘features’
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features
12. ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word,number or
symbol.
More complex for languages where white
spaces are not present (such as Chinese)
or switch direction (such as an Arabic text).
13. iii.Stemming:
Remove the inflectional prefixes and suffixes
of each word to reduce different grammatical
form of the word to a common base form
called a ‘stem’.
Eg:
connected,connecting & interconnection
↓ ↓ ↓
‘connect’
14. iv.Selection features:
•Extract features for each search result
present in the input.
•Features are atomic entities by which we
can describe an object and represent its
most important characteristic to an
algorithm.
•Features vary from single word to tuples of
word.
15. How can represent a feature/text?
• Vector Space Model(VSM)
• Document d is represented in the VSM as a
vector [wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
d→“Polly had a dog and the dog had Polly”
vsm representation