Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Challenges in Distributed
Caching

Information Retrieval

Ricardo Baeza-Yates1,2
Joint work with: C. Castillo1 , F. Junqueira1 ,
V. Plachouras1 and F. Silvestri3

1. Yahoo! Research Barcelona – Catalunya, Spain
2. Yahoo! Research Latin America – Santiago, Chile
3. ISTI-CNR – Pisa, Italy

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling
Crawling
1
Indexing

Query
Processing

Caching
Indexing
2

Query Processing
3

Caching
4

Challenges in
Main Modules and Issues
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing
Partition Dependability Communication External
Query (sync.) factors
Processing
Crawling URL assignment Re-crawl URL Web growth,
Caching
exchanges Content change,
Network topology,
Bandwidth, DNS,
QoS of servers
Indexing Doc. partition, Re-index Partial Web growth,
Term partition indexing, Content change,
updating, Global statistics
merging
Querying Query routing, Replication, Rank Changing user
Collection caching aggregation, needs, User base
selection, Load Personaliza- growth, DNS
balancing tion

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Crawling
1
Caching
Indexing
2
3 Query Processing
4 Caching

Challenges in
Crawling
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

In theory it is simple: fetch, parse, fetch, parse, . . .

Challenges in
Crawling
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

In theory it is simple: fetch, parse, fetch, parse, . . .
In practice it is diﬃcult: implies using other people’s
resources (web servers’ CPU and network)

Challenges in
Issues
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

How to partition the crawling task?

Challenges in
Issues
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

What to do when one agent fails?

Challenges in
Issues
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

How to communicate among agents?

Challenges in
Issues
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

How to communicate among agents?
How to deal with external factors?

Challenges in
Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Host-based partitioning exploits locality of links
Processing

Caching
Balance improves if large/small hosts are treated
diﬀerently
Performance improves if geographic location is considered

Challenges in
Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Host-based partitioning exploits locality of links
Processing

Caching
Balance improves if large/small hosts are treated
diﬀerently
Performance improves if geographic location is considered

Consistent hashing
Allows to add and remove agents from the
pool [Boldi et al., 2004]

Challenges in
Communication
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

Host-based partitioning reduces communication
Highly-linked URLs should be cached
Communication with the server can be improved if server
cooperates

Challenges in
External factors
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
DNS can be a bottleneck
Varying quality of implementation of HTTP
Varying quality of HTML coding
Varying quality of service in general
SPAM

Challenges in
What’s Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Indexing in Database and IR is the process of building an
Caching

index over a collection of documents

Challenges in
What’s Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

Inverted Indexes are typically used in IR indexes

Challenges in
What’s Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

Lexicon: contains distinct terms appearing in the
collection’s documents

Challenges in
What’s Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

Lexicon: contains distinct terms appearing in the
collection’s documents
Posting Lists: contains descriptions of occurrences of
relative terms within the corresponding documents

Challenges in
Index and Distributed Indexing
Distributed IR

Ricardo
Baeza-Yates
D
Crawling
T1
Indexing

Query
Term T2
Processing
Partition
D
Caching

Tn

T

T
Document
Partition

D1 D2 Dm

Challenges in
Document Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput
new documents are easily added to existing indexes

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput
load balanced

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput
load balanced
cons:

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput
load balanced
cons:
high number of disk operations

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
higher throughput
load balanced
cons:
high number of disk operations
high volume of data read from disk

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
require the entire index to be built before slicing it into
partitions

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
partitions
not scalable with large collections

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
partitions
cons:

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
partitions
cons:
reduced number of disk accesses

Challenges in
Term Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing
Caching
pros:
partitions
cons:
reduced number of disk accesses
reduced volume of exchanged data

Challenges in
Partitioning Goals
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
partitioning is the ﬁrst design issue to be faced in
distributed indexing

Challenges in
Partitioning Goals
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
a distributed index should allow for eﬃcient query routing
and resolution

Challenges in
Partitioning Goals
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
a distributed index should allow for eﬃcient query routing
and resolution
reduction of the number of nodes queried, is desirable too

Challenges in
Partitioning Techniques
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing

Caching

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing
Caching

topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing
Caching

documents are ﬁrstly clustered and then each partition is
composed by one (or more) cluster(s)

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing
Caching

usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
random partitioning
Processing
Caching

usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])
clustering is induced by the way users interact with the
index

Challenges in
Load Balancing Issues
Distributed IR

Ricardo
Baeza-Yates
In document partitioned indexes not adopting collection
selection strategies, load is almost balanced among all
Crawling

Indexing
the query processors
Query
In term partitioned indexes (even the new pipelined
Processing
schema [Webber et al., 2006]) load balancing is an issue
Caching

In federated document partitioned systems where
collection selection is applied, balancing the load is still
an unexplored issue.
100.0 100.0

80.0 80.0
Load percentage

Load percentage
60.0 60.0

40.0 40.0

20.0 20.0

0.0 0.0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Document-distributed Pipelined

Challenges in
Exploiting Usage Information
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
Query logs contain features that are critical for
optimizing eﬃciency of diﬀerent parts of search engines

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
query distribution

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
query distribution
query arrival time

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
query distribution
query arrival time
clickthrough information

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
query distribution
query arrival time
clickthrough information
...

Challenges in
Usage Information in Term Partitioned Systems
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
processors
bin-packing approach [Moﬀat et al., 2006]

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
processors
bin-packing approach [Moﬀat et al., 2006]
data mining approach [Lucchese et al., 2007]

Challenges in
Usage Information in Document Partitioned
Distributed IR

Systems
Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

random partitioning does not ensure load
Caching

balancing [Badue et al., 2006]

Challenges in
Distributed IR

Systems
Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents

Challenges in
Distributed IR

Systems
Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents
Usage-based mapping can be adopted to partition
sub-collections that can be eﬀectively discriminated upon
query receipt [Puppin et al., 2006]

Challenges in
Challenges in Distributed Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
in document partitioned system it is needed to ﬁnd
partitioning strategies for enhancing collection selection
performance in terms of eﬀectiveness

Challenges in
Challenges in Distributed Indexing
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
in document partitioned system it is needed to find
partitioning strategies for enhancing collection selection
performance in terms of effectiveness
in both systems it is a challenges to find effective load
balancing strategies

Challenges in
Query processing
Distributed IR

Ricardo
Baeza-Yates

System components
Crawling

Indexing
Clients submitting queries
Query
Processing
Sites consisting of servers
Caching
Servers are commodity computers
Query processing
System receives a query
Query routing: forwarding query to appropriate sites
Merging results
Challenges
Determine appropriate sites on the ﬂy
WAN communication is costly

Challenges in
Challenges in more detail
Distributed IR

Ricardo
Baeza-Yates

Large-scale systems
Crawling

Indexing
Large amount of data
Query
Processing
Large data structures
Caching
Large number of clients and servers
Partitioning of data structures
Necessary due to very large data structures
Parallel processing
e.g. document collection split by topic, language, region
Replication of data structures
For availability, throughput, and response time
Conﬂict with resource utilization

Challenges in
Framework for Distributed Query Processing
Distributed IR

Ricardo
Baeza-Yates
Site B
Region Y
Crawling
Site A
Indexing Region X
Query
Processing

Caching 2
1

Client 3
WAN

Site C
Region Z

Query processor matches documents to the received queries
Coordinator receives queries and routes them to appropriate
sites
Cache stores results from previous queries

Challenges in
Currently...
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Multiple sites
Processing

Sites are full replicas of each other
Caching

Simple query routing: Dynamic DNS
According to the previous framework, opportunity to
Use storage resources more eﬃciently
More sophisticated query routing mechanisms
Eﬀective partition strategies (e.g., language-based strategies)

Challenges in
Partitioning
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Goals
Query
Processing
Achieve cost-effective scalability
Caching

Reduce response times
Potential solutions
Partition of large data structures by topic, language, etc.
Effective query routing first to local sites, then to global sites
Incremental presentation of results to alleviate network
latencies

Challenges in
Dependability
Distributed IR

Ricardo
Baeza-Yates

Goals
Crawling

Indexing
Availability of query processors
Query
Processing
Consistency of replicated query data (can be weak)
Caching
Consistency of user state: e.g., personalization, user
preferences
Potetial solutions
More network resources: multi-homed sites
Replication: within and across sites
Consistency: techniques for weak consistency (replicas
eventually converge)
Caching: improve availability when query processors are
unavailable

Challenges in
Dependability
Distributed IR

Ricardo
Baeza-Yates Achieving availability is not straighforward
Crawling
BIRN system studied by Junqueira and
Marzullo [Junqueira and Marzullo, 2005]
Indexing

Query
Partitions are quite frequent
Processing

Caching

12

10
Average number of sites

8

6

4

2

0
< 100 < 99.8 < 99 < 98 < 97
Monthly availability

Challenges in
Communication
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing Message latency
Query
Communication is costly in wide-area networks
Processing

Caching
Latency is not neglible
Reduced capacity of servers as the latency to process a query
increases
Potential solutions
Reduce as much as possible the number of sites contacted to
process a query
Most queries processed by sites that are close according to
network distance

Challenges in
Caching query results or
Distributed IR

postings [Baeza-Yates et al., 2007]
Ricardo
Baeza-Yates

Crawling

Caching query answers:
Indexing

Query
44% of queries are singletons (appear only once)
Processing

Caching
88% of the unique queries are singletons
Inﬁnite cache would achieve 56% hit-ratio

Caching postings of terms:
4% of terms are singletons
73% of the unique terms (the vocabulary) are singletons
Inﬁnite cache would achieve 96% hit-ratio

Note: All statistics and graphs on caching refer to a one-year query
log from yahoo.co.uk

Challenges in
Static or dynamic caching of postings
Distributed IR

Ricardo
Baeza-Yates

Crawling
Static caching of postings (Qtf)
Indexing

Cache terms with the highest query log frequency fq (t)
Query
Processing

Caching
However, there is a tradeoﬀ between fq (t) and fd (t)
Terms with high query log frequency fq (t) are good for the
cache
Terms with high document frequency fd (t) occupy too much
space

Static caching of postings as a KnapSack problem (QtfDf)
fq (t)
Cache posting lists of terms with the highest ratio fd (t)

Challenges in
Static or dynamic caching of postings
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

Challenges in
Analysis of static caching
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Trade-oﬀs between caching postings and answers
Processing

Caching postings results in more hits
Caching

Caching answers is faster
To compare need to consider time/space parameters

Problem: Given a ﬁxed amount of memory and the average
response times for a system, how much to allocate for caching
answers and how much for caching postings?

Challenges in
Analysis of static caching
Distributed IR

Ricardo
Baeza-Yates

Crawling Scenario 1: Centralized retrieval system, complete/partial query
evaluation, un/compressed postings
Indexing

Query
Postings cache can answer more queries than answers cache
Processing

Caching
Most available memory for caching postings

Scenario 2: WAN distributed system, complete/partial query
evaluation, un/compressed postings
Network time dominates
Most available memory for caching answers

Query Dynamics
Slowly changing query dynamics makes static caching viable

Challenges in
Distributed IR

Ricardo
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and
Baeza-Yates

Ziviani, N. (2006).
Crawling

Analyzing imbalance among homogeneous index servers in a
Indexing
web search system.
Query
Processing
Information Processing & Management.
Caching

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,
Silvestri, F., and Plachouras, V. (2007).
The impact of caching on search engines.
In Proceedings of the Internation ACM SIGIR Conference (to
appear), Amsterdam, Neatherlands.

Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).
Ubicrawler: a scalable fully distributed web crawler.
Software, Practice and Experience, 34(8):711–726.

Challenges in
Distributed IR
Junqueira, F. and Marzullo, K. (2005).
Ricardo
Coterie availability in sites.
Baeza-Yates

In Proceedings of the International Conference on Distributed
Crawling
Computing (DISC), number 3724 in LNCS, pages 3–17,
Indexing
Krakow, Poland. Springer Verlag.
Query
Processing
Larkey, L. S., Connell, M. E., and Callan, J. (2000).
Caching

Collection selection and results merging with topically
organized u.s. patents and trec data.
In CIKM ’00: Proceedings of the ninth international conference
on Information and knowledge management, pages 282–289,
New York, NY, USA. ACM Press.

Liu, X. and Croft, W. B. (2004).
Cluster-based retrieval using language models.
In SIGIR ’04: Proceedings of the 27th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 186–193, New York, NY, USA.
ACM Press.

Challenges in
Distributed IR
Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).
Ricardo
Baeza-Yates
Mining query logs to optimize index partitioning in parallel web
search engines.
Crawling

To Appear in Proceedings of The 2nd International Conference
Indexing

on Scalable Information Systems (INFOSCALE 2007).
Query
Processing

Caching
Moﬀat, A., Webber, W., and Zobel, J. (2006).
Load balancing for term-distributed parallel retrieval.
In SIGIR ’06: Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 348–355, New York, NY, USA.
ACM Press.
Puppin, D., Silvestri, F., and Laforenza, D. (2006).
Query-driven document partitioning and collection selection.
In InfoScale ’06: Proceedings of the 1st international
conference on Scalable information systems, page 34, New
York, NY, USA. ACM Press.

Challenges in
Distributed IR

Ricardo
Baeza-Yates

Crawling

Indexing

Query
Webber, W., Moﬀat, A., Zobel, J., and Baeza-Yates, R.
Processing

(2006).
Caching

A pipelined architecture for distributed text query evaluation.
Information Retrieval.
published online October 5, 2006.

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Similar to Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey) (20)

More from Carlos Castillo (ChaTo)

More from Carlos Castillo (ChaTo) (20)

Recently uploaded

Recently uploaded (20)

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)