Incremental clustering in search engines

•

0 likes•357 views

Praxitelis Nikolaos Kouroupetroglou

Internet

Aristotle University of Thessaloniki
School of Computer Science - Master Studies - Spring Semester
Course: Web Information Mining and Retrieval
Instructor: Vakali Athina
Kouroupetroglou
Praxitelis Nikolaos
Incremental Clustering
In Search Engines

Search engines and results retrieval
● Conventional document retrieval systems return long lists of ranked documents
● Search engines with low precision
● hard for users to find the information they are looking for.
● Improvements: filtering methods, advanced pruning options, clustering
● (-) clustering algorithms rely on off-line clustering of the entire document collection
● Clustering has to be applied to the much smaller set of documents returned in
response to a query.

Clustering and search engines - Key concepts
● Relevance: group documents relevant to document’s context and the user’s query
● Browsable Summaries: The user needs to watch at a glance whether a cluster's
contents are of interest
● Overlap: Since documents have multiple topics, it is important to avoid confining
each document to only one cluster
● Snippet-tolerance: high quality clusters even when it only has access to the snippets
returned by the search engines, as most users are unwilling to wait while the system
downloads the original documents off the Web.
● Speed: fast clustering for impatient users
● Incrementality: To save time, the method should start to process each snippet as
soon as it is received over the Web.

Suffix Tree Clustering (STC)
● From Department of Computer Science and Engineering, University of Washington
● a novel, incremental, O(n) time algorithm
● Treats a document as a string
● use of proximity information between words.
● STC relies on a suffix tree to efficiently identify sets of documents that share
common phrases
● uses this information to create clusters and to summarize their contents
● MetaCrawler-STC, to test it out

STC Steps
● Step 1 - Document "Cleaning"
○ Light stemming (deleting prefixes, suffixes, plural to singular form)
○ Remove html tags
○ Transform each in string and the document in string array having pointers to each word
● Step 2 - Identifying Base Clusters
○ Creating a Suffix tree structure, constructed in time linear and incrementally as the
documents are being read
○ Each Node contains a list of phrases and a list of document with this common phrases
● Step 3 - Combining Base Clusters
○ Combine base clusters with a binary similarity function,
○ Sim is 1 iff prerequisites are met, 0 otherwise
○ Usually top k clusters are kept, there are of interest
○ Score function:
● Images and functions from [1]

Advantages - Experiments
● STC in incremental, Each new
document, is added to the suffix tree.
Nodes updated/created. Updating
the relevant base clusters and
recalculating the similarity of these
base clusters to the rest of the
clusters.
● Linear time (inserting and cleaning
document and creating new clusters)
Image from [1]

References
● [1] Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren
Etzioni Department of Computer Science and Engineering University of Washington
Seattle, WA 98195-2350 U.S.A.
● [2] Suffix Tree, https://en.wikipedia.org/wiki/Suffix_tree
● [3] Suffix Tree Clustering, https://en.wikipedia.org/wiki/Suffix_tree_clustering

What's hot

Elasticsearch - basics and beyondErnesto Reig

score based ranking of documentsKriti Khanna

Heterogeneous data annotationYomna Mahmoud Ibrahim Hassan

Data mining presentationDaffodil International University

Real Time Competitive Marketing Intelligencefeiwin

Supporting scientific discovery through linkages of literature and dataDon Pellegrino

Data-Applied: Technology InsightsDataminingTools Inc

Query expansion_group42_ireKovidaN

IPRES 2014 paper presentation: significant environment information for LTDPFabio Corubolo

Data Mining: clustering and analysisDataminingTools Inc

STAT Requirement Analysisstat

Object Relational Database Management System(ORDBMS)Rabin BK

What's hot (12)

Elasticsearch - basics and beyond

score based ranking of documents

Heterogeneous data annotation

Data mining presentation

Real Time Competitive Marketing Intelligence

Supporting scientific discovery through linkages of literature and data

Data-Applied: Technology Insights

Query expansion_group42_ire

IPRES 2014 paper presentation: significant environment information for LTDP

Data Mining: clustering and analysis

STAT Requirement Analysis

Object Relational Database Management System(ORDBMS)

Viewers also liked

Semantic Linked DataPraxitelis Nikolaos Kouroupetroglou

Linked data and Graph propertiesPraxitelis Nikolaos Kouroupetroglou

Estimating Causal Effects from ObservationsAntigoni-Maria Founta

Experimental Causal InferenceAntigoni-Maria Founta

Exploring Language Communities on GithubAntigoni-Maria Founta

Τweetfix: Data Analytics on Match FixingAntigoni-Maria Founta

MachineLearning_MPI_vs_SparkXudong Brandon Liang

Seeds Affinity Propagation Based on Text ClusteringIJRES Journal

Optimization for iterative queries on Mapreducemakoto onizuka

06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh

Social Media Fraud MetricsAntigoni-Maria Founta

Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic

Lec4 Clusteringmobius.cn

Transitivity of TrustAntigoni-Maria Founta

Opinion miningAntigoni-Maria Founta

Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf

05 k-means clusteringSubhas Kumar Ghosh

Periscope: A Content-based Image Retrieval EngineAntigoni-Maria Founta

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru

Data clustering using map reduceVarad Meru

Viewers also liked (20)

Semantic Linked Data

Linked data and Graph properties

Estimating Causal Effects from Observations

Experimental Causal Inference

Exploring Language Communities on Github

Τweetfix: Data Analytics on Match Fixing

MachineLearning_MPI_vs_Spark

Seeds Affinity Propagation Based on Text Clustering

Optimization for iterative queries on Mapreduce

06 how to write a map reduce version of k-means clustering

Social Media Fraud Metrics

Spark Bi-Clustering - OW2 Big Data Initiative, altic

Lec4 Clustering

Transitivity of Trust

Opinion mining

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

05 k-means clustering

Periscope: A Content-based Image Retrieval Engine

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...

Data clustering using map reduce

Similar to Incremental clustering in search engines

Web clustering enginesYash Darak

Paper id 37201536IJRAT

A Competent and Empirical Model of Distributed ClusteringIRJET Journal

web clustering enginesArun TR

IRJET- Semantics based Document ClusteringIRJET Journal

GrouperPreet Kanwal

4.4 text miningKrish_ver2

Toward Personalized Peer-to-Peer Top-k Processingasapteam

algoritma klastering.pdfbintis1

03 cs3024 pankaj_jajooMeetika Gupta

The International Journal of Engineering and Science (IJES)theijes

IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET Journal

H04564550IOSR-JEN

Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications

Text clusteringKU Leuven

Hierarchical clustering in Python and beyondFrank Kelly

clustering_classification.pptHODECE21

Classification of text data using feature clustering algorithmeSAT Publishing House

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DataeXascale Infolab

Design of file system architecture with clustereSAT Publishing House

Similar to Incremental clustering in search engines (20)

Web clustering engines

Paper id 37201536

A Competent and Empirical Model of Distributed Clustering

web clustering engines

IRJET- Semantics based Document Clustering

Grouper

4.4 text mining

Toward Personalized Peer-to-Peer Top-k Processing

algoritma klastering.pdf

03 cs3024 pankaj_jajoo

The International Journal of Engineering and Science (IJES)

IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...

H04564550

Improved Text Mining for Bulk Data Using Deep Learning Approach

Text clustering

Hierarchical clustering in Python and beyond

clustering_classification.ppt

Classification of text data using feature clustering algorithm

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Design of file system architecture with cluster

Recently uploaded

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther

定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs

Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

PHP-based rendering of TYPO3 DocumentationLinaWolf1

Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton

Git and Github workshop GDSC MLRITMgdsc13

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco

Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan

Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard

Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1

VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0

Font Performance - NYC WebPerf Meetup April '24Paul Calvano

Recently uploaded (20)

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)

定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一

Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service

PHP-based rendering of TYPO3 Documentation

Blepharitis inflammation of eyelid symptoms cause everything included along w...

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)

Git and Github workshop GDSC MLRITM

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service

Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书

Call Girls Near The Suryaa Hotel New Delhi 9873777170

Magic exist by Marta Loveguard - presentation.pptx

Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service

VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room

Font Performance - NYC WebPerf Meetup April '24

Incremental clustering in search engines

1. Aristotle University of Thessaloniki School of Computer Science - Master Studies - Spring Semester Course: Web Information Mining and Retrieval Instructor: Vakali Athina Kouroupetroglou Praxitelis Nikolaos Incremental Clustering In Search Engines

2. Search engines and results retrieval ● Conventional document retrieval systems return long lists of ranked documents ● Search engines with low precision ● hard for users to find the information they are looking for. ● Improvements: filtering methods, advanced pruning options, clustering ● (-) clustering algorithms rely on off-line clustering of the entire document collection ● Clustering has to be applied to the much smaller set of documents returned in response to a query.

3. Clustering and search engines - Key concepts ● Relevance: group documents relevant to document’s context and the user’s query ● Browsable Summaries: The user needs to watch at a glance whether a cluster's contents are of interest ● Overlap: Since documents have multiple topics, it is important to avoid confining each document to only one cluster ● Snippet-tolerance: high quality clusters even when it only has access to the snippets returned by the search engines, as most users are unwilling to wait while the system downloads the original documents off the Web. ● Speed: fast clustering for impatient users ● Incrementality: To save time, the method should start to process each snippet as soon as it is received over the Web.

4. Suffix Tree Clustering (STC) ● From Department of Computer Science and Engineering, University of Washington ● a novel, incremental, O(n) time algorithm ● Treats a document as a string ● use of proximity information between words. ● STC relies on a suffix tree to efficiently identify sets of documents that share common phrases ● uses this information to create clusters and to summarize their contents ● MetaCrawler-STC, to test it out

5. STC Steps ● Step 1 - Document "Cleaning" ○ Light stemming (deleting prefixes, suffixes, plural to singular form) ○ Remove html tags ○ Transform each in string and the document in string array having pointers to each word ● Step 2 - Identifying Base Clusters ○ Creating a Suffix tree structure, constructed in time linear and incrementally as the documents are being read ○ Each Node contains a list of phrases and a list of document with this common phrases ● Step 3 - Combining Base Clusters ○ Combine base clusters with a binary similarity function, ○ Sim is 1 iff prerequisites are met, 0 otherwise ○ Usually top k clusters are kept, there are of interest ○ Score function: ● Images and functions from [1]

6. Suffix Tree Structure Image from [1]

7. Advantages - Experiments ● STC in incremental, Each new document, is added to the suffix tree. Nodes updated/created. Updating the relevant base clusters and recalculating the similarity of these base clusters to the rest of the clusters. ● Linear time (inserting and cleaning document and creating new clusters) Image from [1]

8. References ● [1] Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni Department of Computer Science and Engineering University of Washington Seattle, WA 98195-2350 U.S.A. ● [2] Suffix Tree, https://en.wikipedia.org/wiki/Suffix_tree ● [3] Suffix Tree Clustering, https://en.wikipedia.org/wiki/Suffix_tree_clustering

9. Aristotle University of Thessaloniki School of Computer Science - Master Studies - Spring Semester Course: Web Information Mining and Retrieval Instructor: Vakali Athina Kouroupetroglou Praxitelis Nikolaos Incremental Clustering In Search Engines

Incremental clustering in search engines

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (20)

Similar to Incremental clustering in search engines

Similar to Incremental clustering in search engines (20)

Recently uploaded

Recently uploaded (20)

Incremental clustering in search engines