SlideShare a Scribd company logo
1 of 29
WEB CLUSTERING
ENGINES
ARUN TR
14
12130413
S7CS,CEA
Search Engine?
• Search engines are an invaluable tool for
retrieving information from the Web.
In response to a user query, they return a
list of results ranked in order of relevance
to the query.
• Eg: Google,Yahoo,Credo,Grokker etc.
Arun TR
14,S7CS
• Google (Flat Ranked Search Engine)
Arun TR
14,S7CS
Flat Ranked VS Clustered
• Yippy(Web Clustering Engine)
Arun TR
14,S7CS
Why Web Clustering
Engines?
• Conventional Engines are not much
efficient in ‘Ambiguous’ queries.
• The search results returned by
conventional search engines on query will
be mixed together in the list,irrelevant
items occurs.
In this context clustering of search results
come in to picture!!
Arun TR
14,S7CS
• Search engine
• Clustering is the act of grouping similar
object into sets.
• The distance between the objects in the
same cluster(inter-cluster variations)
should be minimum
• The distance between objects in different
clusters(intra-cluster variations) should be
maximum.
Web Clustering Engines?
Arun TR
14,S7CS
• This systems group the results returned by
a search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Vivısimo - cluster labels were dynamically generated
3. Clusty,
4. Grokker,
5. KartOO,
6. Lingo3G,
7. CREDO,etc
Arun TR
14,S7CS
Main advantages of the
cluster hierarchy
• It makes for shortcuts to the items that relate to
the same meaning.
• It allows better topic understanding.
• It favors systematic exploration of search
results.
Arun TR
14,S7CS
• Short input data description.
• Meaningful labels.
• Selection of similarity measure.
• Grouping of objects into clusters.
• Computational efficiency.
• Unknown number of clusters.
Issues in Implementation Of
clusters
Arun TR
14,S7CS
Architecture & Techniques
Arun TR
14,S7CS
1.Search Results Acquisition
• Provides input for the rest of the system.
• Based on the query, the acquisition
component must deliver 50 to 500 results,
each of which should contain a title, a
contextual snippet, and the URL
• The source of search results can be any
public search engines, such as
Google,Yahoo etc.
• Fetching results from other search
engines by API of these engines.
Arun TR
14,S7CS
2.Preprocessing of Search
results
• Primary aim is to convert the search
results into ‘features’
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features
Arun TR
14,S7CS
ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word,number or
symbol.
More complex for languages where white
spaces are not present (such as Chinese)
or switch direction (such as an Arabic text).
Arun TR
14,S7CS
iii.Stemming:
Remove the inflectional prefixes and suffixes
of each word to reduce different grammatical
form of the word to a common base form
called a ‘stem’.
Eg:
connected,connecting & interconnection
↓ ↓ ↓
‘connect’
Arun TR
14,S7CS
iv.Selection features:
•Extract features for each search result
present in the input.
•Features are atomic entities by which we
can describe an object and represent its
most important characteristic to an
algorithm.
•Features vary from single word to tuples of
word.
Arun TR
14,S7CS
How can represent a feature/text?
• Vector Space Model(VSM)
• Document d is represented in the VSM as a
vector [wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
d→“Polly had a dog and the dog had Polly”
vsm representation
Arun TR
14,S7CS
3.Cluster Construction &
Labelling
• The set of search results along with their
features are input to the clustering algorithm,
for building the clusters and labeling.
Two types of Algorithms:
→Data centric clustering algorithm
→Description aware –STC related
• Created cluster should be aptly labled.
i.Unique ii.Unambiguous iii.Comprehensive
iv.Sensible to the content
Arun TR
14,S7CS
Data Centric Clustering Algorithm
• Similar to Agglomerative Hierarchical
Clustering (AHC) with an average-link
merge criterion.
• It has initial clustering of a collection of
documents in a set of k clusters(scatter)
• At Query time the user selected clusters of
interest(gather) and the system re-
clustered those documents.
• Process repeats until a small cluster with
relevant documents is found
Arun TR
14,S7CS
Function of a Scatter/Gather system
Arun TR
14,S7CS
• Bottom up approach. Initially each
document is in its own cluster.
• Build a distance matrix for every pair of
clusters. Merge 2 closest clusters and
build the new distance matrix by replacing
the merged cluster by one cluster.
• Continue this process until the desired no
of k clusters reached.
• The Complexity of this algorithm is clearly
O(n2
), n: number of clusters
• Another Data centric algorithm is called as
K-means clustering
Arun TR
14,S7CS
Difficulties in Data centric
algorithms
• All these algorithms are not incremental in
nature - each document arrives from the
web,we “clean” it and add it to the
available model.
• Missing of meaningful labels.
Arun TR
14,S7CS
4.Visualization of Clustered
Results
• One prominent approach is based on hierarchical folders
• Clusty, CREDO, Lingo3G - hierarchical folder visualization
approach
• Grokker - Nesting ,zooming approach
• KartOO - Graph based interfaces
Arun TR
14,S7CS
Credo - hierarchical folder visualization approach
Grokker – Nesting and Zooming
Improve Efficiency of
Clustering
• Client side processing:High query rate
periods the response times can significantly
increase. Some processes using the client
side resources
• Incremental processing:As each
document arrives from the web, we “clean”
it and add it to the available model.
• Pretokenized documents:Clustering
engines can use tokens that already used
by the conventional search engines.
Arun TR
14,S7CS
Conclusion
Web clustering engines organize search results by
topic, thus offering a complementary view to the
flat-ranked list returned by conventional search
engines. A number of advances must be made to
improve the cluster labels, coherence of cluster
structure, performance evaluation studies,advanced
visualization techniques. Then Web Clustering
Engines entirely fulfills the promise of being the
PageRank of the future.
Due to the lack of an efficient method for the
performance evaluation of clustering engines they
are still not seeking the attention of people.
Arun TR
14,S7CS
References
• http://clusty.com
• http://credo.fub.it
• http://www2.parc.com/istl/projects/ia/sg-
example1.html
• http://credino.dimi.uniud.it
• http://google.com
• C.J.Van Rijsbergen , Information
Retrieval, Butterworth
Arun TR
14,S7CS
THANK YOU
QUESTIONS?

More Related Content

What's hot

Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
From Taxonomies to Ontologies
From Taxonomies to OntologiesFrom Taxonomies to Ontologies
From Taxonomies to OntologiesChristine Connors
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
information retrieval
information retrievalinformation retrieval
information retrievalssbd6985
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to MetadataEUDAT
 

What's hot (20)

Web mining
Web miningWeb mining
Web mining
 
Text clustering
Text clusteringText clustering
Text clustering
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
From Taxonomies to Ontologies
From Taxonomies to OntologiesFrom Taxonomies to Ontologies
From Taxonomies to Ontologies
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
information retrieval
information retrievalinformation retrieval
information retrieval
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Text mining
Text miningText mining
Text mining
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to Metadata
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
 

Similar to web clustering engines

Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceeSAT Journals
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works reportSovan Misra
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536IJRAT
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET Journal
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual GuideIRJET Journal
 
F0362036045
F0362036045F0362036045
F0362036045theijes
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMrtpaem
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
 
webclustering engine
webclustering enginewebclustering engine
webclustering engineDeepak Sharma
 

Similar to web clustering engines (20)

Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
Measures of query cost
Measures of query costMeasures of query cost
Measures of query cost
 
CloWSer
CloWSerCloWSer
CloWSer
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works report
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
F0362036045
F0362036045F0362036045
F0362036045
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
webclustering engine
webclustering enginewebclustering engine
webclustering engine
 

Recently uploaded

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptAfnanAhmad53
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...ssuserdfc773
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxMustafa Ahmed
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257subhasishdas79
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdfKamal Acharya
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 

Recently uploaded (20)

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 

web clustering engines

  • 2. Search Engine? • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query. • Eg: Google,Yahoo,Credo,Grokker etc. Arun TR 14,S7CS
  • 3. • Google (Flat Ranked Search Engine) Arun TR 14,S7CS Flat Ranked VS Clustered
  • 4. • Yippy(Web Clustering Engine) Arun TR 14,S7CS
  • 5. Why Web Clustering Engines? • Conventional Engines are not much efficient in ‘Ambiguous’ queries. • The search results returned by conventional search engines on query will be mixed together in the list,irrelevant items occurs. In this context clustering of search results come in to picture!! Arun TR 14,S7CS
  • 6. • Search engine • Clustering is the act of grouping similar object into sets. • The distance between the objects in the same cluster(inter-cluster variations) should be minimum • The distance between objects in different clusters(intra-cluster variations) should be maximum. Web Clustering Engines? Arun TR 14,S7CS
  • 7. • This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). Web clustering engines: 1. Northern Light - predefined set of clusters 2. Vivısimo - cluster labels were dynamically generated 3. Clusty, 4. Grokker, 5. KartOO, 6. Lingo3G, 7. CREDO,etc Arun TR 14,S7CS
  • 8. Main advantages of the cluster hierarchy • It makes for shortcuts to the items that relate to the same meaning. • It allows better topic understanding. • It favors systematic exploration of search results. Arun TR 14,S7CS
  • 9. • Short input data description. • Meaningful labels. • Selection of similarity measure. • Grouping of objects into clusters. • Computational efficiency. • Unknown number of clusters. Issues in Implementation Of clusters Arun TR 14,S7CS
  • 11. 1.Search Results Acquisition • Provides input for the rest of the system. • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL • The source of search results can be any public search engines, such as Google,Yahoo etc. • Fetching results from other search engines by API of these engines. Arun TR 14,S7CS
  • 12. 2.Preprocessing of Search results • Primary aim is to convert the search results into ‘features’ steps: i.Language identification ii.Tokenization iii.Stemming iv.Selection features Arun TR 14,S7CS
  • 13. ii.Tokenization: Text of each search result gets split into a sequence of basic independent units called tokens represent by word,number or symbol. More complex for languages where white spaces are not present (such as Chinese) or switch direction (such as an Arabic text). Arun TR 14,S7CS
  • 14. iii.Stemming: Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’. Eg: connected,connecting & interconnection ↓ ↓ ↓ ‘connect’ Arun TR 14,S7CS
  • 15. iv.Selection features: •Extract features for each search result present in the input. •Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. •Features vary from single word to tuples of word. Arun TR 14,S7CS
  • 16. How can represent a feature/text? • Vector Space Model(VSM) • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn] where t0, t1, . . . tn is a set of words/features and wti is the weight/importance of feature ti Eg: d→“Polly had a dog and the dog had Polly” vsm representation Arun TR 14,S7CS
  • 17. 3.Cluster Construction & Labelling • The set of search results along with their features are input to the clustering algorithm, for building the clusters and labeling. Two types of Algorithms: →Data centric clustering algorithm →Description aware –STC related • Created cluster should be aptly labled. i.Unique ii.Unambiguous iii.Comprehensive iv.Sensible to the content Arun TR 14,S7CS
  • 18. Data Centric Clustering Algorithm • Similar to Agglomerative Hierarchical Clustering (AHC) with an average-link merge criterion. • It has initial clustering of a collection of documents in a set of k clusters(scatter) • At Query time the user selected clusters of interest(gather) and the system re- clustered those documents. • Process repeats until a small cluster with relevant documents is found Arun TR 14,S7CS
  • 19. Function of a Scatter/Gather system Arun TR 14,S7CS
  • 20. • Bottom up approach. Initially each document is in its own cluster. • Build a distance matrix for every pair of clusters. Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one cluster. • Continue this process until the desired no of k clusters reached. • The Complexity of this algorithm is clearly O(n2 ), n: number of clusters • Another Data centric algorithm is called as K-means clustering Arun TR 14,S7CS
  • 21. Difficulties in Data centric algorithms • All these algorithms are not incremental in nature - each document arrives from the web,we “clean” it and add it to the available model. • Missing of meaningful labels. Arun TR 14,S7CS
  • 22. 4.Visualization of Clustered Results • One prominent approach is based on hierarchical folders • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach • Grokker - Nesting ,zooming approach • KartOO - Graph based interfaces Arun TR 14,S7CS
  • 23. Credo - hierarchical folder visualization approach
  • 24. Grokker – Nesting and Zooming
  • 25. Improve Efficiency of Clustering • Client side processing:High query rate periods the response times can significantly increase. Some processes using the client side resources • Incremental processing:As each document arrives from the web, we “clean” it and add it to the available model. • Pretokenized documents:Clustering engines can use tokens that already used by the conventional search engines. Arun TR 14,S7CS
  • 26. Conclusion Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. A number of advances must be made to improve the cluster labels, coherence of cluster structure, performance evaluation studies,advanced visualization techniques. Then Web Clustering Engines entirely fulfills the promise of being the PageRank of the future. Due to the lack of an efficient method for the performance evaluation of clustering engines they are still not seeking the attention of people. Arun TR 14,S7CS
  • 27. References • http://clusty.com • http://credo.fub.it • http://www2.parc.com/istl/projects/ia/sg- example1.html • http://credino.dimi.uniud.it • http://google.com • C.J.Van Rijsbergen , Information Retrieval, Butterworth Arun TR 14,S7CS