SlideShare a Scribd company logo
1 of 23
Web Clustering Engines
YASH DARAK
206117026
CONTENTS
● Introduction
● Why web clustering engines?
● Advantages of cluster hierarchy
● Issues in implementation of clusters
● Architecture
● Data centric clustering algorithm
● Conclusion
Search engines ?
● Search engines are an invaluable tool for retrieving information from the web. In
response to a user query, they return a list of results ranked in order of relevance to
the query.
● Eg : Google, Yahoo, Credo etc.
Google (Flat ranked search engine)
Yippy (Web clustering engine)
Web clustering engines
● Search engine.
● Web Clustering Engines are the systems that perform clustering of web search
results. This systems group the results returned by a search engine into a hierarchy of
labeled clusters (also called categories).
● Clustering is the act of grouping similar objects into sets.
● The distance between the objects in the same cluster should be minimum.
● And the distance between objects in the different clusters should be maximum.
Web clustering engines -
1. Northern Light (predefined set of clusters )
2. Vivisimo - Cluster labels were dynamically generated.
3. Clusty
4. Grokker
5. Yippy
6. Lingo3G
7. Credo etc..
Why web clustering engines ?
● Conventional engines are not much efficient in ‘Ambiguous’ queries.
● The search results returned by conventional search engines on query will be
mixed together in the list, irrelevant item occurs.
In this context clustering of search results come into picture!!
Main advantages of cluster hierarchy :
● It makes for shortcuts to the items that relate to the same meaning.
● It allows better topic understanding.
● It favors systematic exploration of search results.
Issues in implementation of clusters :
● Short input description.
● Meaningful labels.
● Selection of similarity measure.
● Grouping of objects into clusters.
● Computational efficiency.
● Overlapping clusters.
● Unknown number of clusters.
Architecture :
1. Search Result Acquisition :
● The task of the search result acquisition is to provide input for the rest of the system.
● Based on the query, the acquisition component must deliver 50 to 500 results, each of
which should contain -
■ Title
■ Contextual snippet
■ URL pointing to the full text being referred to.
● The source of search results can be any public search engines, such as google, yahoo etc.
● The most elegant way of fetching results from such search engines is by using application
programming interfaces(APIs) these engines provide.
2. Preprocessing of search results :
● It converts the contents of search results (output by the acquisition component) into a
sequence of features used by the actual clustering algorithm.
● Steps for feature extraction -
a. Language identification
b. Tokenization
c. Stemming
d. Selection of features.
b. Tokenization :
● During the tokenization step, the text of each search result gets split into a sequence of
basic independent units called tokens, which will usually represent single words, numbers,
symbols and so on.
● Tokenization becomes much more complex for languages where white spaces are not
present (such as Chinese) or where the text may switch direction (such as an Arabic text).
c. Stemming :
● The aim of stemming is to remove the inflectional prefixes and suffixes of each word and
thus reduce different grammatical forms of the word to a common base form called a stem.
● Eg.
Connected, Connecting and interconnected
‘Connect’
d. Selection features :
● It extract features for each search result present in the input.
● Features are atomic entities by which we can describe an object and represent its most
important characteristic to an algorithm.
● The features can vary from single words and fixed-length tuples of words (n-grams) to
frequent phrases (variable-length sequences of words)
How to represent a feature/text ?
● One method for representing a text is Vector Space model(VSM).
● A document d is represented in the VSM as a vector [wt0 , wt1, . . .wtn], where t0, t1, . . . tn is
a global set of words (features) and wti expresses the weight (importance) of feature ti to
document d.
● Eg. :
d-> “Polly had a dog and the dog had Polly”
3. Cluster construction and labelling :
● The set of search results along with their features, extracted in the preprocessing step, are
given as input to the clustering algorithm.
● There are a number of algorithms available for clustering. We can classify them into two
different categories -
a. Data centric Clustering algorithm
b. Description aware.
● The clusters labels should be unique, unambiguous, comprehensive and sensible to the
content.
Data centric clustering algorithm :
● This system uses VSM for text representation and the clustering technique used is
agglomerative hierarchical clustering (AHC).
● It has an initial clustering of a collection of documents in a set of k clusters(scattering).
● .At Query time the user selected clusters of interest(gather) and the system re-clustered
those documents.
● This process repeats until a small cluster with relevant documents is found.
Agglomerative Hierarchical Clustering(AHC) :
● Initially each document is in its own cluster.
● It build a distance matrix (dissimilarity matrix) for every pair of clusters.
● Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one
cluster.
● Continue this process until the desired no of k clusters reached.
● The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the
number of clusters.
Improve efficiency of clustering
1. Client side processing : During high query rate periods the response times can significantly
increase and thus degrade the user experience. For avoiding this we can do some processes
using the client side resources.
2. Pretokenized Documents : Clustering engines can use tokens that are already used by the
conventional search engines.
Conclusion
● Web clustering engines organize search results by topic, thus offering a
complementary view to the flat-ranked list returned by conventional search engines.
● Due to lack of efficient methods for the performance evaluation of clustering engines
they are not seeking the attention of the people.
Thank you all for your kind
attention!!

More Related Content

What's hot

INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE AND JSP PROCESSING
INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE  AND JSP PROCESSINGINTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE  AND JSP PROCESSING
INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE AND JSP PROCESSINGAaqib Hussain
 
Firewall security in computer network
Firewall security in computer networkFirewall security in computer network
Firewall security in computer networkpoorvavyas4
 
Firewall and Types of firewall
Firewall and Types of firewallFirewall and Types of firewall
Firewall and Types of firewallCoder Tech
 
Grid protocol architecture
Grid protocol architectureGrid protocol architecture
Grid protocol architecturePooja Dixit
 
Traditional Firewall vs. Next Generation Firewall
Traditional Firewall vs. Next Generation FirewallTraditional Firewall vs. Next Generation Firewall
Traditional Firewall vs. Next Generation Firewall美兰 曾
 
Zone Routing Protocol
Zone Routing ProtocolZone Routing Protocol
Zone Routing Protocolnitss007
 
Location Aided Routing (LAR)
Location Aided Routing (LAR) Location Aided Routing (LAR)
Location Aided Routing (LAR) Pradeep Kumar TS
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithmsMazin Alwaaly
 
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLS
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLSMOBILE IP,DHCP,ADHOC ROUTING PROTOCOLS
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLSManju La
 
819 Static Channel Allocation
819 Static Channel Allocation819 Static Channel Allocation
819 Static Channel Allocationtechbed
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variablesSuveeksha
 
Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)Anshul gour
 
Wireless Sensor Networks
Wireless Sensor NetworksWireless Sensor Networks
Wireless Sensor Networksjuno susi
 
Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayerRahul Hada
 

What's hot (20)

Alternative metrics
Alternative metricsAlternative metrics
Alternative metrics
 
INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE AND JSP PROCESSING
INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE  AND JSP PROCESSINGINTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE  AND JSP PROCESSING
INTRODUCTION TO JSP,JSP LIFE CYCLE, ANATOMY OF JSP PAGE AND JSP PROCESSING
 
Firewall security in computer network
Firewall security in computer networkFirewall security in computer network
Firewall security in computer network
 
firewall.ppt
firewall.pptfirewall.ppt
firewall.ppt
 
Firewall and Types of firewall
Firewall and Types of firewallFirewall and Types of firewall
Firewall and Types of firewall
 
Grid protocol architecture
Grid protocol architectureGrid protocol architecture
Grid protocol architecture
 
Traditional Firewall vs. Next Generation Firewall
Traditional Firewall vs. Next Generation FirewallTraditional Firewall vs. Next Generation Firewall
Traditional Firewall vs. Next Generation Firewall
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
 
Zone Routing Protocol
Zone Routing ProtocolZone Routing Protocol
Zone Routing Protocol
 
Location Aided Routing (LAR)
Location Aided Routing (LAR) Location Aided Routing (LAR)
Location Aided Routing (LAR)
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithms
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
CS8601 MOBILE COMPUTING
CS8601	MOBILE COMPUTING CS8601	MOBILE COMPUTING
CS8601 MOBILE COMPUTING
 
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLS
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLSMOBILE IP,DHCP,ADHOC ROUTING PROTOCOLS
MOBILE IP,DHCP,ADHOC ROUTING PROTOCOLS
 
819 Static Channel Allocation
819 Static Channel Allocation819 Static Channel Allocation
819 Static Channel Allocation
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
 
Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)
 
Data link layer
Data link layer Data link layer
Data link layer
 
Wireless Sensor Networks
Wireless Sensor NetworksWireless Sensor Networks
Wireless Sensor Networks
 
Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayer
 

Similar to Web Clustering Engines: Clustering Search Results into a Hierarchical Structure

Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceeSAT Journals
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.pptHODECE21
 
Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationijcsit
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
Query optimization
Query optimizationQuery optimization
Query optimizationPooja Dixit
 
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...IRJET Journal
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET Journal
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchLucidworks
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmeSAT Publishing House
 

Similar to Web Clustering Engines: Clustering Search Results into a Hierarchical Structure (20)

Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configuration
 
Descriptive m0deling
Descriptive m0delingDescriptive m0deling
Descriptive m0deling
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
Text Summarization of Food Reviews using AbstractiveSummarization and Recurre...
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
stavies
staviesstavies
stavies
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Web Clustering Engines: Clustering Search Results into a Hierarchical Structure

  • 1. Web Clustering Engines YASH DARAK 206117026
  • 2. CONTENTS ● Introduction ● Why web clustering engines? ● Advantages of cluster hierarchy ● Issues in implementation of clusters ● Architecture ● Data centric clustering algorithm ● Conclusion
  • 3. Search engines ? ● Search engines are an invaluable tool for retrieving information from the web. In response to a user query, they return a list of results ranked in order of relevance to the query. ● Eg : Google, Yahoo, Credo etc.
  • 4. Google (Flat ranked search engine)
  • 6. Web clustering engines ● Search engine. ● Web Clustering Engines are the systems that perform clustering of web search results. This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). ● Clustering is the act of grouping similar objects into sets. ● The distance between the objects in the same cluster should be minimum. ● And the distance between objects in the different clusters should be maximum.
  • 7. Web clustering engines - 1. Northern Light (predefined set of clusters ) 2. Vivisimo - Cluster labels were dynamically generated. 3. Clusty 4. Grokker 5. Yippy 6. Lingo3G 7. Credo etc..
  • 8. Why web clustering engines ? ● Conventional engines are not much efficient in ‘Ambiguous’ queries. ● The search results returned by conventional search engines on query will be mixed together in the list, irrelevant item occurs. In this context clustering of search results come into picture!!
  • 9. Main advantages of cluster hierarchy : ● It makes for shortcuts to the items that relate to the same meaning. ● It allows better topic understanding. ● It favors systematic exploration of search results.
  • 10. Issues in implementation of clusters : ● Short input description. ● Meaningful labels. ● Selection of similarity measure. ● Grouping of objects into clusters. ● Computational efficiency. ● Overlapping clusters. ● Unknown number of clusters.
  • 12. 1. Search Result Acquisition : ● The task of the search result acquisition is to provide input for the rest of the system. ● Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain - ■ Title ■ Contextual snippet ■ URL pointing to the full text being referred to. ● The source of search results can be any public search engines, such as google, yahoo etc. ● The most elegant way of fetching results from such search engines is by using application programming interfaces(APIs) these engines provide.
  • 13. 2. Preprocessing of search results : ● It converts the contents of search results (output by the acquisition component) into a sequence of features used by the actual clustering algorithm. ● Steps for feature extraction - a. Language identification b. Tokenization c. Stemming d. Selection of features.
  • 14. b. Tokenization : ● During the tokenization step, the text of each search result gets split into a sequence of basic independent units called tokens, which will usually represent single words, numbers, symbols and so on. ● Tokenization becomes much more complex for languages where white spaces are not present (such as Chinese) or where the text may switch direction (such as an Arabic text).
  • 15. c. Stemming : ● The aim of stemming is to remove the inflectional prefixes and suffixes of each word and thus reduce different grammatical forms of the word to a common base form called a stem. ● Eg. Connected, Connecting and interconnected ‘Connect’
  • 16. d. Selection features : ● It extract features for each search result present in the input. ● Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. ● The features can vary from single words and fixed-length tuples of words (n-grams) to frequent phrases (variable-length sequences of words)
  • 17. How to represent a feature/text ? ● One method for representing a text is Vector Space model(VSM). ● A document d is represented in the VSM as a vector [wt0 , wt1, . . .wtn], where t0, t1, . . . tn is a global set of words (features) and wti expresses the weight (importance) of feature ti to document d. ● Eg. : d-> “Polly had a dog and the dog had Polly”
  • 18. 3. Cluster construction and labelling : ● The set of search results along with their features, extracted in the preprocessing step, are given as input to the clustering algorithm. ● There are a number of algorithms available for clustering. We can classify them into two different categories - a. Data centric Clustering algorithm b. Description aware. ● The clusters labels should be unique, unambiguous, comprehensive and sensible to the content.
  • 19. Data centric clustering algorithm : ● This system uses VSM for text representation and the clustering technique used is agglomerative hierarchical clustering (AHC). ● It has an initial clustering of a collection of documents in a set of k clusters(scattering). ● .At Query time the user selected clusters of interest(gather) and the system re-clustered those documents. ● This process repeats until a small cluster with relevant documents is found.
  • 20. Agglomerative Hierarchical Clustering(AHC) : ● Initially each document is in its own cluster. ● It build a distance matrix (dissimilarity matrix) for every pair of clusters. ● Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one cluster. ● Continue this process until the desired no of k clusters reached. ● The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of clusters.
  • 21. Improve efficiency of clustering 1. Client side processing : During high query rate periods the response times can significantly increase and thus degrade the user experience. For avoiding this we can do some processes using the client side resources. 2. Pretokenized Documents : Clustering engines can use tokens that are already used by the conventional search engines.
  • 22. Conclusion ● Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. ● Due to lack of efficient methods for the performance evaluation of clustering engines they are not seeking the attention of the people.
  • 23. Thank you all for your kind attention!!