SlideShare a Scribd company logo
WEB CLUSTERING
ENGINES
Deepak Sharma
MCA
1409114016
Search Engine?
• Search engines are an invaluable tool for
retrieving information from the Web.
In response to a user query, they return a
list of results ranked in order of relevance
to the query.
• Eg: Google,Yahoo,Credo,Grokker etc.
• Google (Flat Ranked Search Engine)
Flat Ranked VS Clustered
• Yippy(Web Clustering Engine)
Why Web Clustering
Engines?
• Conventional Engines are not much
efficient in ‘Ambiguous’ queries.
• The search results returned by
conventional search engines on query will
be mixed together in the list,irrelevant
items occurs.
In this context clustering of search results
come in to picture!!
• Search engine
• Clustering is the act of grouping similar
object into sets.
• The distance between the objects in the
same cluster(inter-cluster variations)
should be minimum
• The distance between objects in different
clusters(intra-cluster variations) should be
maximum.
Web Clustering Engines?
• This systems group the results returned by
a search engine into a hierarchy of labeled
clusters (also called categories).
Web clustering engines:
1. Northern Light - predefined set of clusters
2. Vivısimo - cluster labels were dynamically generated
3. Clusty,
4. Grokker,
5. KartOO,
6. Lingo3G,
7. CREDO,etc
• Short input data description.
• Meaningful labels.
• Selection of similarity measure.
• Grouping of objects into clusters.
• Computational efficiency.
• Unknown number of clusters.
Issues in Implementation Of
clusters
Architecture & Techniques
Search Results Acquisition
• Provides input for the rest of the system.
• Based on the query, the acquisition
component must deliver 50 to 500 results,
each of which should contain a title, a
contextual snippet, and the URL
• The source of search results can be any
public search engines, such as
Google,Yahoo etc.
• Fetching results from other search
engines by API of these engines.
Preprocessing of Search
results
• Primary aim is to convert the search
results into ‘features’
steps:
i.Language identification
ii.Tokenization
iii.Stemming
iv.Selection features
ii.Tokenization:
Text of each search result gets split into a
sequence of basic independent units called
tokens represent by word,number or
symbol.
More complex for languages where white
spaces are not present (such as Chinese)
or switch direction (such as an Arabic text).
iii.Stemming:
Remove the inflectional prefixes and suffixes
of each word to reduce different grammatical
form of the word to a common base form
called a ‘stem’.
Eg:
connected,connecting & interconnection
↓ ↓ ↓
‘connect’
iv.Selection features:
•Extract features for each search result
present in the input.
•Features are atomic entities by which we
can describe an object and represent its
most important characteristic to an
algorithm.
•Features vary from single word to tuples of
word.
How can represent a feature/text?
• Vector Space Model(VSM)
• Document d is represented in the VSM as a
vector [wt0 , wt1 , . . .wtn]
where t0, t1, . . . tn is a set of words/features
and wti is the weight/importance of feature ti
Eg:
d→“Polly had a dog and the dog had Polly”
vsm representation
THANK YOU

More Related Content

Similar to webclustering engine

CAB 2.pptx
CAB 2.pptxCAB 2.pptx
web clustering engines
web clustering enginesweb clustering engines
web clustering engines
Arun TR
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
adeason
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
SSAA60
 
Whats new in search in SharePoint 2013
Whats new in search in SharePoint 2013Whats new in search in SharePoint 2013
Whats new in search in SharePoint 2013
Michal Pisarek
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......
Smayara
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
SD Sharma
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
Bill Slawski
 
Seo top amazing ppt
Seo  top amazing pptSeo  top amazing ppt
Seo top amazing ppt
Mamthaz M
 
PPT Web Clustering Engine.pptx
PPT Web Clustering Engine.pptxPPT Web Clustering Engine.pptx
PPT Web Clustering Engine.pptx
DhammanandLonare
 
Search engine optimsation
Search engine optimsationSearch engine optimsation
Search engine optimsation
AneenaBinoy2
 
digital marketing on search engine material for marketing students
digital marketing on search engine material for marketing studentsdigital marketing on search engine material for marketing students
digital marketing on search engine material for marketing students
AlazerTesfayeErsasuT
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure Search
Gunnar Peipman
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
Jenifer Renjini
 
Digital marketing course
Digital marketing course Digital marketing course
Digital marketing course
Be-practical Training Institute
 
Understanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEMUnderstanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEM
Anubha Rastogi
 
Digital Marketing Classes in Pune- SIM
Digital Marketing Classes in Pune- SIMDigital Marketing Classes in Pune- SIM
Digital Marketing Classes in Pune- SIM
ChinmayKale14
 
Digital Marketing Classes in PCMC -SIM
Digital Marketing Classes in PCMC -SIMDigital Marketing Classes in PCMC -SIM
Digital Marketing Classes in PCMC -SIM
ChinmayKale14
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
GulshanKumar368
 
SEARCH ENGINE OPTIMIZATION
SEARCH ENGINE OPTIMIZATIONSEARCH ENGINE OPTIMIZATION
SEARCH ENGINE OPTIMIZATION
netultimateemp
 

Similar to webclustering engine (20)

CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
web clustering engines
web clustering enginesweb clustering engines
web clustering engines
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Whats new in search in SharePoint 2013
Whats new in search in SharePoint 2013Whats new in search in SharePoint 2013
Whats new in search in SharePoint 2013
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
 
Seo top amazing ppt
Seo  top amazing pptSeo  top amazing ppt
Seo top amazing ppt
 
PPT Web Clustering Engine.pptx
PPT Web Clustering Engine.pptxPPT Web Clustering Engine.pptx
PPT Web Clustering Engine.pptx
 
Search engine optimsation
Search engine optimsationSearch engine optimsation
Search engine optimsation
 
digital marketing on search engine material for marketing students
digital marketing on search engine material for marketing studentsdigital marketing on search engine material for marketing students
digital marketing on search engine material for marketing students
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure Search
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Digital marketing course
Digital marketing course Digital marketing course
Digital marketing course
 
Understanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEMUnderstanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEM
 
Digital Marketing Classes in Pune- SIM
Digital Marketing Classes in Pune- SIMDigital Marketing Classes in Pune- SIM
Digital Marketing Classes in Pune- SIM
 
Digital Marketing Classes in PCMC -SIM
Digital Marketing Classes in PCMC -SIMDigital Marketing Classes in PCMC -SIM
Digital Marketing Classes in PCMC -SIM
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
SEARCH ENGINE OPTIMIZATION
SEARCH ENGINE OPTIMIZATIONSEARCH ENGINE OPTIMIZATION
SEARCH ENGINE OPTIMIZATION
 

Recently uploaded

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 

Recently uploaded (20)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 

webclustering engine

  • 2. Search Engine? • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query. • Eg: Google,Yahoo,Credo,Grokker etc.
  • 3. • Google (Flat Ranked Search Engine) Flat Ranked VS Clustered
  • 5. Why Web Clustering Engines? • Conventional Engines are not much efficient in ‘Ambiguous’ queries. • The search results returned by conventional search engines on query will be mixed together in the list,irrelevant items occurs. In this context clustering of search results come in to picture!!
  • 6. • Search engine • Clustering is the act of grouping similar object into sets. • The distance between the objects in the same cluster(inter-cluster variations) should be minimum • The distance between objects in different clusters(intra-cluster variations) should be maximum. Web Clustering Engines?
  • 7. • This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). Web clustering engines: 1. Northern Light - predefined set of clusters 2. Vivısimo - cluster labels were dynamically generated 3. Clusty, 4. Grokker, 5. KartOO, 6. Lingo3G, 7. CREDO,etc
  • 8. • Short input data description. • Meaningful labels. • Selection of similarity measure. • Grouping of objects into clusters. • Computational efficiency. • Unknown number of clusters. Issues in Implementation Of clusters
  • 10. Search Results Acquisition • Provides input for the rest of the system. • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL • The source of search results can be any public search engines, such as Google,Yahoo etc. • Fetching results from other search engines by API of these engines.
  • 11. Preprocessing of Search results • Primary aim is to convert the search results into ‘features’ steps: i.Language identification ii.Tokenization iii.Stemming iv.Selection features
  • 12. ii.Tokenization: Text of each search result gets split into a sequence of basic independent units called tokens represent by word,number or symbol. More complex for languages where white spaces are not present (such as Chinese) or switch direction (such as an Arabic text).
  • 13. iii.Stemming: Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’. Eg: connected,connecting & interconnection ↓ ↓ ↓ ‘connect’
  • 14. iv.Selection features: •Extract features for each search result present in the input. •Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. •Features vary from single word to tuples of word.
  • 15. How can represent a feature/text? • Vector Space Model(VSM) • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn] where t0, t1, . . . tn is a set of words/features and wti is the weight/importance of feature ti Eg: d→“Polly had a dog and the dog had Polly” vsm representation