Survey on Parallel/Distributed Search Engines

Survey on Parallel/Distributed Search Engines
Yu Liu@NII
Sep. 20th, 2013
Yu Liu@NII Survey on Parallel/Distributed Search Engines

Background
In web search, the information retrieval system needs
Crawling billions of documents stored on millions of
computers;
Indexing, ranking, clustering TBs of documents;
Responding thousands of quires at same time;
I did this survey also for ﬁnding parallel/distributed applications
that related to my current research.

Background
In web search, the information retrieval system needs
Crawling billions of documents stored on millions of
computers;
Indexing, ranking, clustering TBs of documents;
Responding thousands of quires at same time;
Such tasks can (almost) only be done in a parallel/distributed way.
The basic idea of a distributed search engine:
many machines work on one task to get it done quicker than
one large machine alone
batter fault tolerance (continue operating properly in the
event of the failure)

Distributed (Web) Search Engine
Deﬁnition
Distributed search engine is a search engine model in which the
tasks of Web crawling, indexing and query processing are
distributed among multiple computers and networks.

Distributed Crawling/Indexing/Ranking
Relative simple, e.g., Goog’s MapReduce1 based approach
1
MapReduce:Simplified Data Processing on Large Clusters.
Jeffrey Dean and Sanjay Ghemawat, OSDI’04

Distributed Indexing
(Pic. from Dean and Ghemawat (OSDI’04))

Distributed Search/Query
An Informal Deﬁnation
Finding information form multiple “nodes” where searchable
resources(indices/documents) are stored.
Each “node” only contains a part of the whole resources, and
return a partial result.
A ﬁnal result is produced by aggregating all partial results.
Federated search is an example: each “node” is a search engine,
(Pic. from Wikipedia)

Peer-to-Peer search is a decentralized search engine technology.
(Pic. from YaCy)

Some Examples of Search Engines that Support
Google’s search engine and Microsoft Bing ...(of course)
Indri : http://www.lemurproject.org/indri.php
Apache Sola/Lucent): http://lucene.apache.org/
YaCy (P2P): http://yacy.net/
Grub (P2P) Grub.org...

Challenges in Distributed Search
Distributed search applications must carry out three additional,
important functions 2:
Resource representation: Generating a description of the
resource (documents) stored in each node
E.g., a language model description, generated by query-based
sampling
Resource selection: Selecting some resources based o their
descriptions
E.g.,top-k ranked (query-likelihood) nodes
Result merging: Merging the ranked results list form multiple
nodes
Related to situations of query model and global statistics
2
Text book: Search Engines – Information Retrieval in Practice, by W.Bruce
Croft, et al.

For the three important functions
Better resource representation:
Resource selection: better replication, partitioning of
indices/documents and ranking of search results
Result merging: better eﬀectiveness

More concret examples of Solr– limitations to distributed search:
Each document indexed must have a unique key. If Solr
discovers duplicate document IDs, Solr selects the first
document and discards subsequent ones.
Inverse-document frequency (IDF) calculations cannot be
distributed.
The index for distributed searching may become momentarily
out of sync if a commit happens between the first and second
phase of the distributed search.
Distributed searching supports only sorted-field faceting, not
date faceting
The number of shards is limited by number of characters
allowed for GET method’s URI
Update commands may be sent to any server with distributed
indexing configured correctly.

Relation between my research:
Index updating — incremental MapReduce computation
Base Data already exist
MapReduce provides parallel processing functionality
Incremental computation makes computation eﬃcient
Currently, our study hasn’t considered the problem of “
momentarily out of sync”.

Survey of Parallel Implementations of Clustering
Algorithms
Usually, clustering algorithms are related to two categories:
hierarchical and partitioning.

Hierarchical clustering
A hierarchical clustering is a sequence of partitions in which each
partition is nested into the next partition in the sequence.
Hierarchical clusterings generally fall into two
categories:Top-down and Bottom-up.
The more popular hierarchical agglomerative clustering
(HAC)algorithms use a bottom-up approach to merge items
into a hierarchy of clusters.

Agglomerative hierarchical clustering
An HAC clustering is typically visualized as a dendrogram as shown
in this ﬁgure.
The y-coordinate of the horizontal line is the similarity of the two
clusters that were merged.

The sequential and parallel implementations of hierarchical
clustering
Methods to determine the distances between clusters (Olson 95)[1]
Graph methods
Single link
Average link
Complete link
Geometric methods
Centroid
Median (group-average)
Minimum Variance

Single-link Algorithm (naive)
Naive single-link algorithm performs O(N3) time and O(N2)
space3
3
Introduction to Information Retrieval. ISBN: 0521865719, pp381.

Single-link Algorithm O(N2
)
An eﬃcient single-link algorithm using a next-best-merge array
(NBM) as a optimization4:
4
Introduction to Information Retrieval. ISBN: 0521865719, pp386.

Complexity
Table : Comparison of HAC algorithms.
method combination similarity time complexity optimal
single-link max inter-similarity of any 2 O(N2) yes
complete-link min inter-similarity of any 2 O(N2logN) no
median average of all sims O(N2logN) no
centroid average inter-similarity O(N2logN) no
In practice, the diﬀerence in complexity is rarely a concern when
choosing one of the algorithms. For most cases of documents
clustering, median is a good choice.

The Complexity
Eﬃcient single-link algorithm uses O(N2) time and O(N2) space,
the situations are similar with other algorithms [Olson 94].
A possible problem for MapReduce implementation is that usually
the input is very huge and far beyond the local memory. Even
requirement of N2/p memory for a cluster-node is still not
acceptable.

Parallel HAC Algorithms
There are a lot of studies of parallel HAC algorithms, on SIMD
array processors, n-hypercube, n-butterﬂy and PRAM.(Li, Fang
89[3], Li 90[2]) And basically they can compute in O(N2/p) time.
Manoranjan Dash et al., introduced an approach to compute in
O(N2/cp) time (Dash Manoranjan 04)[4]

Mapreduec-able ?
There is no real parallel implementation of HAC on MapReduce.
Some approaches of MapReduce to reduce the
size(dimensions) of input data to ﬁt the local memory
Some approaches use buckshot approach to improve k-means
Apache Mahout project has a fake MapReduce
implementation 5
5
http://mahout.apache.org/

Diﬃculties for parallelization with MapReduce
A single similarity matrix must be kept consistent among all
computing-nodes, which requires communication whenever updates
are performed. But there is no Broadcast method for MapReduce.
Input data are initially split to each computing-node
N × N pairwise bottom level items of the dendrogram can be
compute by matrix-multiply approach of MapReduce
Parallel N-1 times merge operations are not known, currently.

Survey on Parallel/Distributed Search Engines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Survey on Parallel/Distributed Search Engines

Similar to Survey on Parallel/Distributed Search Engines (20)

More from Yu Liu

More from Yu Liu (20)

Recently uploaded

Recently uploaded (20)

Survey on Parallel/Distributed Search Engines