1. DISTRIBUTED IFORMATION RETRIEVAL
Distributed computing is a field of computer science that studies distributed systems. A distributed system is a
system whose components are located on different networked computers, which communicate and coordinate
their actions by passing messages to one another from any system.
Distributed computing is a model in which components of a software system are shared among multiple
computers or nodes.
Telephone and cellular networks are also examples of distributed networks. Telephone networks have been
around for over a century and it started as an early example of a peer to peer network. Cellular networks are
distributed networks with base stations physically distributed in areas called cells.
Distributed computing allows different users or computers to share information. Distributed computing can
allow an application on one machine to leverage processing power, memory, or storage on another machine.
A multi database model of distributed information retrieval is presented in which people are assumed to have
access to many searchable text databases In such an environment full text information retrieval consists of
discovering database contents ranking databases by their expected ability to satisfy the query searching a small
number of databases and merging results returned by dierent databases
2. DISTRIBUTED IR
Can be viewed as MIMD parallel processor
Relatively slow interprocessor communication
Freedom to employ a heterogenous collection of processors in the system.
Single processing node in DC could be a parallel computer in its own
If they support same public interface and protocol for invoking their services, computers in the system can
be owned and operated by diff parties
Two main difference:
i. Subtasks runs on diff comp and communication between the subtask is performed using TCP/IP
rather than the shared memory base inter-process communication.
ii. Employs a procedure for selecting a subset of distributed servers for processing a particular request
rather than broadcasting every request to every server.
Shared Memory MIMD architecture characteristics:
i. Creates a group of memory modules and processors.
ii. Any processor is able to directly access any memory module by means of an interconnection
network.
iii. The group of memory modules outlines a universal address space that is shared between the
processors.
3.
4. Dc usually involves computation and data
Splitted into coarse-grained operations with relatively little communication required
between the operations.
Parallel IR based on document partitioning fits well.
Documents are always grouped into collections
Either for administrative purposes or for combining similar doc into one source
Collection provides natural granularity for distributing data across servers and
partitioning the computation.
Consider both engineering issues of Dc and algorithm issues of IR.
Engineering issues involve:
i. Defining a search protocol for transmitting requestsa and results
ii. Designing a server that can efficiently accept a request, initiate a subprocess or
thread to service the request.
iii. Exploit any locality inherent in the processing using appropriate caching techniques
iv. Designing a broker that can submit a synchronous search request to multiple
servers in parallel and Combine the intermediate results into a final end user
response.
Algorithmic isues involves:
i. How to distributed documents across distributed search servers
ii. How to select which server should receive the particular search request
iii. How to combine the results from the different servers
5. A protocol should allow a client to:
i. Obtain info about a search server e.g. loist of databases available for searching
at the server.
ii. Submit search request for one or more databases available using well defined
query language
iii. Receive search result in a well defined format
iv. Retrieved items identified in the serach results
6. COLLECTION PARTITIONING:
The Partitioning allows you to specify details about how the incoming data is partitioned or collected
before the operation is performed.
It also allows you to specify that the data should be sorted before being operated on.
Collection Partitioning in the decentralized system:
Distributed document collections are built and maintained independently.
No central control of document partitioning procedure
Each server is focused on particular subject area
Collection Partitioning in centralized system:
Collection can be replicated across all of the search server
Parallelism is being exploited via multitasking
Broker’s job is to route queries to the search servers and balance the loads on the
servers
First option is simple replication across all of the search servers
Second option is random distribution of the documents
Final option is explicit semantic partitioning of the documents
7. SOURCE SELECTION
Process of determining which of the distributed collections are most likely to contain
relevant documents of the current query and therefore should receive the query for
processing
There are two approaches:
1. Simple Approach: Assume that every collection is likely to contain relevant
document and always broadcast the query to all collections
Appropriate when documents are randomly partitioned
2. Can also be ranked according to their likelihood of containing relevant documents
Appropriate when
i. documents are partitioned into semantically meaningful collections
ii. It is prohibitively expensive to search every collection every time
Basic technique is to treat each collection as if it were a single large
document
Index collections
Evaluate the query against the collection to produce a ranked listing of
collections
Apply standard cosine similarity measure using a query vector and collection
vectors
8. To calculate term weight in the collection vector using tf-idf style
Weighing term frequency tfi,j is the total no. of ocurrences of term I in collection j
Inverse document frequency idfi for term I is log(N/ni) where N is the total no of collections
and ni is the no. of collections in which term i appears
Problem with this approach is there may not be individual documents within the collection
that receive high query relevance score, essentially resulting in a false drop and unnecessary
work to score a collection
To avoid this problem, Moffat and Zobel proposed a solution by indexing each collection as
series of block, where each block contains B documents
When B=1, this is equivalent to indexing all of the document as a single, monolithic collection
When B equals the number of documents in each collection, this is equivalent to original
solution
By varying B, a trade off is made between collection index size and likelihood of falsedrops..
9. QUERY PROCESSING
Proceeds as follows:
I. Select collections to search
II. Distribute query to selected collections
III. Evaluate query at distributed collections in parallel
IV. Combine results from distributed collections into final result
Step 1 could be eliminated if query is always broadcast to every document collection
Otherwise one of the algorithms are used for this step
Eachj of the participating search servers then evaluates the query on the selected collections using its own local
search algorithm.
Finally, results are merged
MERGING THE RESULTS:
No of scenarios used for merging the result
If the query is Boolean,Boolean result sets are returned and final result result is equal to the union result
set
If the query involves free text ranking , no. of techniques are available ranging from simple to complex
Simplest approach: Combine the ranked result lost using round robin interleaving
1: 1st doc from 1st list
2: 2nd doc from 2nd list
3: 3rd doc from 3rd list
This is likely to produce poor quality results,since hits from irrelevant collections are given status equal to
11. Proper global tem statistics are used to compute the document scores
If documents are randomly distributed such that global term statistics are
consistent across all of the distributed collections, the merging based on
relevance score is sufficient.
If the documents are semantoically partitioned, then reranking must be
performed.
RERANKING: By weighing document scores based on their collection similarity
computed during the source selection step.
The weigth for a collection can be calculated as :
W= 1+|c|.(s-s^1)/s^
Where |c| is the no. of collection searched ,
s is the collection score
s^ is the mean of the collection scores
More accurate technique for merging ranked result lists is to use accurate global term
statistics.
If the collection have been indexed for source selection , that index will contain
global term statistics across all of the distributed collections
The broker can include these statistics in the query when it distributes the query to
the search servers.
The servers can use this statistics in their processing and produce relevance scores
that can be merged directly.
12. If a collection index is unavailable ,query distribution can proceed in two rounds of communication
In the first round broker distributes the query and gathers collection statistics from each server
These statistics are combined by the broker and distributed back to the servers in the second round
The search protocol can also require that the servers return the global query term statistics and pre-document query
term statistics
The broker is then free to rerank every document using the query term statistics and a ranking algorithm of it’s
choice
The end result is a list that documents from the distributed collections ranked in the same order as if all of the
documents had been indexed in a single collection.