SlideShare a Scribd company logo
Survey on Parallel/Distributed Search Engines
Yu Liu@NII
Sep. 20th, 2013
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Background
In web search, the information retrieval system needs
Crawling billions of documents stored on millions of
computers;
Indexing, ranking, clustering TBs of documents;
Responding thousands of quires at same time;
I did this survey also for finding parallel/distributed applications
that related to my current research.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Background
In web search, the information retrieval system needs
Crawling billions of documents stored on millions of
computers;
Indexing, ranking, clustering TBs of documents;
Responding thousands of quires at same time;
Such tasks can (almost) only be done in a parallel/distributed way.
The basic idea of a distributed search engine:
many machines work on one task to get it done quicker than
one large machine alone
batter fault tolerance (continue operating properly in the
event of the failure)
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed (Web) Search Engine
Definition
Distributed search engine is a search engine model in which the
tasks of Web crawling, indexing and query processing are
distributed among multiple computers and networks.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed Crawling/Indexing/Ranking
Relative simple, e.g., Goog’s MapReduce1 based approach
1
MapReduce:Simplified Data Processing on Large Clusters.
Jeffrey Dean and Sanjay Ghemawat, OSDI’04
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed Crawling/Indexing/Ranking
Relative simple, e.g., Goog’s MapReduce1 based approach
1
MapReduce:Simplified Data Processing on Large Clusters.
Jeffrey Dean and Sanjay Ghemawat, OSDI’04
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed Indexing
(Pic. from Dean and Ghemawat (OSDI’04))
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed Search/Query
An Informal Defination
Finding information form multiple “nodes” where searchable
resources(indices/documents) are stored.
Each “node” only contains a part of the whole resources, and
return a partial result.
A final result is produced by aggregating all partial results.
Federated search is an example: each “node” is a search engine,
(Pic. from Wikipedia)
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Distributed Search/Query
Peer-to-Peer search is a decentralized search engine technology.
(Pic. from YaCy)
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Some Examples of Search Engines that Support
Distributed Search/Query
Google’s search engine and Microsoft Bing ...(of course)
Indri : http://www.lemurproject.org/indri.php
Apache Sola/Lucent): http://lucene.apache.org/
YaCy (P2P): http://yacy.net/
Grub (P2P) Grub.org...
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Challenges in Distributed Search
Distributed search applications must carry out three additional,
important functions 2:
Resource representation: Generating a description of the
resource (documents) stored in each node
E.g., a language model description, generated by query-based
sampling
Resource selection: Selecting some resources based o their
descriptions
E.g.,top-k ranked (query-likelihood) nodes
Result merging: Merging the ranked results list form multiple
nodes
Related to situations of query model and global statistics
2
Text book: Search Engines – Information Retrieval in Practice, by W.Bruce
Croft, et al.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Challenges in Distributed Search
For the three important functions
Better resource representation:
Resource selection: better replication, partitioning of
indices/documents and ranking of search results
Result merging: better effectiveness
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Challenges in Distributed Search
More concret examples of Solr– limitations to distributed search:
Each document indexed must have a unique key. If Solr
discovers duplicate document IDs, Solr selects the first
document and discards subsequent ones.
Inverse-document frequency (IDF) calculations cannot be
distributed.
The index for distributed searching may become momentarily
out of sync if a commit happens between the first and second
phase of the distributed search.
Distributed searching supports only sorted-field faceting, not
date faceting
The number of shards is limited by number of characters
allowed for GET method’s URI
Update commands may be sent to any server with distributed
indexing configured correctly.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Challenges in Distributed Search
Relation between my research:
Index updating — incremental MapReduce computation
Base Data already exist
MapReduce provides parallel processing functionality
Incremental computation makes computation efficient
Currently, our study hasn’t considered the problem of “
momentarily out of sync”.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Survey of Parallel Implementations of Clustering
Algorithms
Usually, clustering algorithms are related to two categories:
hierarchical and partitioning.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Hierarchical clustering
A hierarchical clustering is a sequence of partitions in which each
partition is nested into the next partition in the sequence.
Hierarchical clusterings generally fall into two
categories:Top-down and Bottom-up.
The more popular hierarchical agglomerative clustering
(HAC)algorithms use a bottom-up approach to merge items
into a hierarchy of clusters.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Agglomerative hierarchical clustering
An HAC clustering is typically visualized as a dendrogram as shown
in this figure.
The y-coordinate of the horizontal line is the similarity of the two
clusters that were merged.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
The sequential and parallel implementations of hierarchical
clustering
Methods to determine the distances between clusters (Olson 95)[1]
Graph methods
Single link
Average link
Complete link
Geometric methods
Centroid
Median (group-average)
Minimum Variance
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Single-link Algorithm (naive)
Naive single-link algorithm performs O(N3) time and O(N2)
space3
3
Introduction to Information Retrieval. ISBN: 0521865719, pp381.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Single-link Algorithm O(N2
)
An efficient single-link algorithm using a next-best-merge array
(NBM) as a optimization4:
4
Introduction to Information Retrieval. ISBN: 0521865719, pp386.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Complexity
Table : Comparison of HAC algorithms.
method combination similarity time complexity optimal
single-link max inter-similarity of any 2 O(N2) yes
complete-link min inter-similarity of any 2 O(N2logN) no
median average of all sims O(N2logN) no
centroid average inter-similarity O(N2logN) no
In practice, the difference in complexity is rarely a concern when
choosing one of the algorithms. For most cases of documents
clustering, median is a good choice.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
The Complexity
Efficient single-link algorithm uses O(N2) time and O(N2) space,
the situations are similar with other algorithms [Olson 94].
A possible problem for MapReduce implementation is that usually
the input is very huge and far beyond the local memory. Even
requirement of N2/p memory for a cluster-node is still not
acceptable.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Parallel HAC Algorithms
There are a lot of studies of parallel HAC algorithms, on SIMD
array processors, n-hypercube, n-butterfly and PRAM.(Li, Fang
89[3], Li 90[2]) And basically they can compute in O(N2/p) time.
Manoranjan Dash et al., introduced an approach to compute in
O(N2/cp) time (Dash Manoranjan 04)[4]
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Mapreduec-able ?
There is no real parallel implementation of HAC on MapReduce.
Some approaches of MapReduce to reduce the
size(dimensions) of input data to fit the local memory
Some approaches use buckshot approach to improve k-means
Apache Mahout project has a fake MapReduce
implementation 5
5
http://mahout.apache.org/
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Difficulties for parallelization with MapReduce
A single similarity matrix must be kept consistent among all
computing-nodes, which requires communication whenever updates
are performed. But there is no Broadcast method for MapReduce.
Input data are initially split to each computing-node
N × N pairwise bottom level items of the dendrogram can be
compute by matrix-multiply approach of MapReduce
Parallel N-1 times merge operations are not known, currently.
Yu Liu@NII Survey on Parallel/Distributed Search Engines
Yu Liu@NII Survey on Parallel/Distributed Search Engines

More Related Content

What's hot

Apache mahout
Apache mahoutApache mahout
Apache mahout
Puneet Gupta
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge Acquisition
Lihua Zhao
 
The Landscape of Ontology Reuse in Linked Data - OEDW2012
The Landscape of Ontology Reuse in Linked Data - OEDW2012The Landscape of Ontology Reuse in Linked Data - OEDW2012
The Landscape of Ontology Reuse in Linked Data - OEDW2012
María Poveda Villalón
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Ajit Koti
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
Varad Meru
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
NYC Predictive Analytics
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Save Manos
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Geoffrey Fox
 
Machine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine LearningMachine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine Learning
Hafiz Muhammad Attaullah
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
A Model of the Scholarly Community
A Model of the Scholarly CommunityA Model of the Scholarly Community
A Model of the Scholarly Community
Marko Rodriguez
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
Cataldo Musto
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
James Chen
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
Uri Lavi
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
Cataldo Musto
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
Geoffrey Fox
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 

What's hot (20)

Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge Acquisition
 
The Landscape of Ontology Reuse in Linked Data - OEDW2012
The Landscape of Ontology Reuse in Linked Data - OEDW2012The Landscape of Ontology Reuse in Linked Data - OEDW2012
The Landscape of Ontology Reuse in Linked Data - OEDW2012
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Machine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine LearningMachine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine Learning
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
A Model of the Scholarly Community
A Model of the Scholarly CommunityA Model of the Scholarly Community
A Model of the Scholarly Community
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 

Similar to Survey on Parallel/Distributed Search Engines

A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
CSCJournals
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
Ikutwa
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
IR tutorial
IR tutorialIR tutorial
IR tutorial
Hussein Hazimeh
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
Editor IJMTER
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
vimalsura
 
G1803054653
G1803054653G1803054653
G1803054653
IOSR Journals
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
IJRAT
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Frank Kelly
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
David Smiley
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworks
ElenaEtchemendy1
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
HODECE21
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 

Similar to Survey on Parallel/Distributed Search Engines (20)

A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
IR tutorial
IR tutorialIR tutorial
IR tutorial
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
G1803054653
G1803054653G1803054653
G1803054653
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworks
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 

More from Yu Liu

A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
 
Cloud Era Transactional Processing -- Problems, Strategies and Solutions
Cloud Era Transactional Processing -- Problems, Strategies and SolutionsCloud Era Transactional Processing -- Problems, Strategies and Solutions
Cloud Era Transactional Processing -- Problems, Strategies and Solutions
Yu Liu
 
Introduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDocIntroduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDoc
Yu Liu
 
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
Yu Liu
 
Paper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
Paper introduction to Combinatorial Optimization on Graphs of Bounded TreewidthPaper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
Paper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
Yu Liu
 
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionPaper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
Yu Liu
 
An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013
Yu Liu
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
Yu Liu
 
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
Yu Liu
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)
Yu Liu
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on Spark
Yu Liu
 
Introduction of A Lightweight Stage-Programming Framework
Introduction of A Lightweight Stage-Programming FrameworkIntroduction of A Lightweight Stage-Programming Framework
Introduction of A Lightweight Stage-Programming Framework
Yu Liu
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
Yu Liu
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
Yu Liu
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
Yu Liu
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
Yu Liu
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
Yu Liu
 
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on HadoopScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
Yu Liu
 
A Homomorphism-based MapReduce Framework for Systematic Parallel Programming
A Homomorphism-based MapReduce Framework for Systematic Parallel ProgrammingA Homomorphism-based MapReduce Framework for Systematic Parallel Programming
A Homomorphism-based MapReduce Framework for Systematic Parallel Programming
Yu Liu
 

More from Yu Liu (20)

A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Cloud Era Transactional Processing -- Problems, Strategies and Solutions
Cloud Era Transactional Processing -- Problems, Strategies and SolutionsCloud Era Transactional Processing -- Problems, Strategies and Solutions
Cloud Era Transactional Processing -- Problems, Strategies and Solutions
 
Introduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDocIntroduction to NTCIR 2016 MedNLPDoc
Introduction to NTCIR 2016 MedNLPDoc
 
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
高性能データ処理プラットフォーム (Talk on July Tech Festa 2015)
 
Paper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
Paper introduction to Combinatorial Optimization on Graphs of Bounded TreewidthPaper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
Paper introduction to Combinatorial Optimization on Graphs of Bounded Treewidth
 
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionPaper Introduction: Combinatorial Model and Bounds for Target Set Selection
Paper Introduction: Combinatorial Model and Bounds for Target Set Selection
 
An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on Spark
 
Introduction of A Lightweight Stage-Programming Framework
Introduction of A Lightweight Stage-Programming FrameworkIntroduction of A Lightweight Stage-Programming Framework
Introduction of A Lightweight Stage-Programming Framework
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
 
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on HadoopScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
 
A Homomorphism-based MapReduce Framework for Systematic Parallel Programming
A Homomorphism-based MapReduce Framework for Systematic Parallel ProgrammingA Homomorphism-based MapReduce Framework for Systematic Parallel Programming
A Homomorphism-based MapReduce Framework for Systematic Parallel Programming
 

Recently uploaded

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

Survey on Parallel/Distributed Search Engines

  • 1. Survey on Parallel/Distributed Search Engines Yu Liu@NII Sep. 20th, 2013 Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 2. Background In web search, the information retrieval system needs Crawling billions of documents stored on millions of computers; Indexing, ranking, clustering TBs of documents; Responding thousands of quires at same time; I did this survey also for finding parallel/distributed applications that related to my current research. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 3. Background In web search, the information retrieval system needs Crawling billions of documents stored on millions of computers; Indexing, ranking, clustering TBs of documents; Responding thousands of quires at same time; Such tasks can (almost) only be done in a parallel/distributed way. The basic idea of a distributed search engine: many machines work on one task to get it done quicker than one large machine alone batter fault tolerance (continue operating properly in the event of the failure) Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 4. Distributed (Web) Search Engine Definition Distributed search engine is a search engine model in which the tasks of Web crawling, indexing and query processing are distributed among multiple computers and networks. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 5. Distributed Crawling/Indexing/Ranking Relative simple, e.g., Goog’s MapReduce1 based approach 1 MapReduce:Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI’04 Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 6. Distributed Crawling/Indexing/Ranking Relative simple, e.g., Goog’s MapReduce1 based approach 1 MapReduce:Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI’04 Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 7. Distributed Indexing (Pic. from Dean and Ghemawat (OSDI’04)) Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 8. Distributed Search/Query An Informal Defination Finding information form multiple “nodes” where searchable resources(indices/documents) are stored. Each “node” only contains a part of the whole resources, and return a partial result. A final result is produced by aggregating all partial results. Federated search is an example: each “node” is a search engine, (Pic. from Wikipedia) Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 9. Distributed Search/Query Peer-to-Peer search is a decentralized search engine technology. (Pic. from YaCy) Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 10. Some Examples of Search Engines that Support Distributed Search/Query Google’s search engine and Microsoft Bing ...(of course) Indri : http://www.lemurproject.org/indri.php Apache Sola/Lucent): http://lucene.apache.org/ YaCy (P2P): http://yacy.net/ Grub (P2P) Grub.org... Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 11. Challenges in Distributed Search Distributed search applications must carry out three additional, important functions 2: Resource representation: Generating a description of the resource (documents) stored in each node E.g., a language model description, generated by query-based sampling Resource selection: Selecting some resources based o their descriptions E.g.,top-k ranked (query-likelihood) nodes Result merging: Merging the ranked results list form multiple nodes Related to situations of query model and global statistics 2 Text book: Search Engines – Information Retrieval in Practice, by W.Bruce Croft, et al. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 12. Challenges in Distributed Search For the three important functions Better resource representation: Resource selection: better replication, partitioning of indices/documents and ranking of search results Result merging: better effectiveness Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 13. Challenges in Distributed Search More concret examples of Solr– limitations to distributed search: Each document indexed must have a unique key. If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones. Inverse-document frequency (IDF) calculations cannot be distributed. The index for distributed searching may become momentarily out of sync if a commit happens between the first and second phase of the distributed search. Distributed searching supports only sorted-field faceting, not date faceting The number of shards is limited by number of characters allowed for GET method’s URI Update commands may be sent to any server with distributed indexing configured correctly. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 14. Challenges in Distributed Search Relation between my research: Index updating — incremental MapReduce computation Base Data already exist MapReduce provides parallel processing functionality Incremental computation makes computation efficient Currently, our study hasn’t considered the problem of “ momentarily out of sync”. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 15. Survey of Parallel Implementations of Clustering Algorithms Usually, clustering algorithms are related to two categories: hierarchical and partitioning. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 16. Hierarchical clustering A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. Hierarchical clusterings generally fall into two categories:Top-down and Bottom-up. The more popular hierarchical agglomerative clustering (HAC)algorithms use a bottom-up approach to merge items into a hierarchy of clusters. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 17. Agglomerative hierarchical clustering An HAC clustering is typically visualized as a dendrogram as shown in this figure. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 18. The sequential and parallel implementations of hierarchical clustering Methods to determine the distances between clusters (Olson 95)[1] Graph methods Single link Average link Complete link Geometric methods Centroid Median (group-average) Minimum Variance Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 19. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 20. Single-link Algorithm (naive) Naive single-link algorithm performs O(N3) time and O(N2) space3 3 Introduction to Information Retrieval. ISBN: 0521865719, pp381. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 21. Single-link Algorithm O(N2 ) An efficient single-link algorithm using a next-best-merge array (NBM) as a optimization4: 4 Introduction to Information Retrieval. ISBN: 0521865719, pp386. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 22. Complexity Table : Comparison of HAC algorithms. method combination similarity time complexity optimal single-link max inter-similarity of any 2 O(N2) yes complete-link min inter-similarity of any 2 O(N2logN) no median average of all sims O(N2logN) no centroid average inter-similarity O(N2logN) no In practice, the difference in complexity is rarely a concern when choosing one of the algorithms. For most cases of documents clustering, median is a good choice. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 23. The Complexity Efficient single-link algorithm uses O(N2) time and O(N2) space, the situations are similar with other algorithms [Olson 94]. A possible problem for MapReduce implementation is that usually the input is very huge and far beyond the local memory. Even requirement of N2/p memory for a cluster-node is still not acceptable. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 24. Parallel HAC Algorithms There are a lot of studies of parallel HAC algorithms, on SIMD array processors, n-hypercube, n-butterfly and PRAM.(Li, Fang 89[3], Li 90[2]) And basically they can compute in O(N2/p) time. Manoranjan Dash et al., introduced an approach to compute in O(N2/cp) time (Dash Manoranjan 04)[4] Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 25. Mapreduec-able ? There is no real parallel implementation of HAC on MapReduce. Some approaches of MapReduce to reduce the size(dimensions) of input data to fit the local memory Some approaches use buckshot approach to improve k-means Apache Mahout project has a fake MapReduce implementation 5 5 http://mahout.apache.org/ Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 26. Difficulties for parallelization with MapReduce A single similarity matrix must be kept consistent among all computing-nodes, which requires communication whenever updates are performed. But there is no Broadcast method for MapReduce. Input data are initially split to each computing-node N × N pairwise bottom level items of the dendrogram can be compute by matrix-multiply approach of MapReduce Parallel N-1 times merge operations are not known, currently. Yu Liu@NII Survey on Parallel/Distributed Search Engines
  • 27. Yu Liu@NII Survey on Parallel/Distributed Search Engines