UNIT_4.pptx

DISTRIBUTED IFORMATION RETRIEVAL
 Distributed computing is a field of computer science that studies distributed systems. A distributed system is a
system whose components are located on different networked computers, which communicate and coordinate
their actions by passing messages to one another from any system.
 Distributed computing is a model in which components of a software system are shared among multiple
computers or nodes.
 Telephone and cellular networks are also examples of distributed networks. Telephone networks have been
around for over a century and it started as an early example of a peer to peer network. Cellular networks are
distributed networks with base stations physically distributed in areas called cells.
 Distributed computing allows different users or computers to share information. Distributed computing can
allow an application on one machine to leverage processing power, memory, or storage on another machine.
 A multi database model of distributed information retrieval is presented in which people are assumed to have
access to many searchable text databases In such an environment full text information retrieval consists of
discovering database contents ranking databases by their expected ability to satisfy the query searching a small
number of databases and merging results returned by dierent databases

DISTRIBUTED IR
 Can be viewed as MIMD parallel processor
 Relatively slow interprocessor communication
 Freedom to employ a heterogenous collection of processors in the system.
 Single processing node in DC could be a parallel computer in its own
 If they support same public interface and protocol for invoking their services, computers in the system can
be owned and operated by diff parties
 Two main difference:
i. Subtasks runs on diff comp and communication between the subtask is performed using TCP/IP
rather than the shared memory base inter-process communication.
ii. Employs a procedure for selecting a subset of distributed servers for processing a particular request
rather than broadcasting every request to every server.
 Shared Memory MIMD architecture characteristics:
i. Creates a group of memory modules and processors.
ii. Any processor is able to directly access any memory module by means of an interconnection
network.
iii. The group of memory modules outlines a universal address space that is shared between the
processors.

 Dc usually involves computation and data
 Splitted into coarse-grained operations with relatively little communication required
between the operations.
 Parallel IR based on document partitioning fits well.
 Documents are always grouped into collections
 Either for administrative purposes or for combining similar doc into one source
 Collection provides natural granularity for distributing data across servers and
partitioning the computation.
 Consider both engineering issues of Dc and algorithm issues of IR.
 Engineering issues involve:
i. Defining a search protocol for transmitting requestsa and results
ii. Designing a server that can efficiently accept a request, initiate a subprocess or
thread to service the request.
iii. Exploit any locality inherent in the processing using appropriate caching techniques
iv. Designing a broker that can submit a synchronous search request to multiple
servers in parallel and Combine the intermediate results into a final end user
response.
 Algorithmic isues involves:
i. How to distributed documents across distributed search servers
ii. How to select which server should receive the particular search request
iii. How to combine the results from the different servers

 A protocol should allow a client to:
i. Obtain info about a search server e.g. loist of databases available for searching
at the server.
ii. Submit search request for one or more databases available using well defined
query language
iii. Receive search result in a well defined format
iv. Retrieved items identified in the serach results

COLLECTION PARTITIONING:
The Partitioning allows you to specify details about how the incoming data is partitioned or collected
before the operation is performed.
It also allows you to specify that the data should be sorted before being operated on.
Collection Partitioning in the decentralized system:
 Distributed document collections are built and maintained independently.
 No central control of document partitioning procedure
 Each server is focused on particular subject area
Collection Partitioning in centralized system:
 Collection can be replicated across all of the search server
 Parallelism is being exploited via multitasking
 Broker’s job is to route queries to the search servers and balance the loads on the
servers
 First option is simple replication across all of the search servers
 Second option is random distribution of the documents
 Final option is explicit semantic partitioning of the documents

SOURCE SELECTION
 Process of determining which of the distributed collections are most likely to contain
relevant documents of the current query and therefore should receive the query for
processing
 There are two approaches:
1. Simple Approach: Assume that every collection is likely to contain relevant
document and always broadcast the query to all collections
 Appropriate when documents are randomly partitioned
2. Can also be ranked according to their likelihood of containing relevant documents
 Appropriate when
i. documents are partitioned into semantically meaningful collections
ii. It is prohibitively expensive to search every collection every time
 Basic technique is to treat each collection as if it were a single large
document
 Index collections
 Evaluate the query against the collection to produce a ranked listing of
collections
 Apply standard cosine similarity measure using a query vector and collection
vectors

 To calculate term weight in the collection vector using tf-idf style
 Weighing term frequency tfi,j is the total no. of ocurrences of term I in collection j
 Inverse document frequency idfi for term I is log(N/ni) where N is the total no of collections
and ni is the no. of collections in which term i appears
 Problem with this approach is there may not be individual documents within the collection
that receive high query relevance score, essentially resulting in a false drop and unnecessary
work to score a collection
 To avoid this problem, Moffat and Zobel proposed a solution by indexing each collection as
series of block, where each block contains B documents
 When B=1, this is equivalent to indexing all of the document as a single, monolithic collection
 When B equals the number of documents in each collection, this is equivalent to original
solution
 By varying B, a trade off is made between collection index size and likelihood of falsedrops..

QUERY PROCESSING
 Proceeds as follows:
I. Select collections to search
II. Distribute query to selected collections
III. Evaluate query at distributed collections in parallel
IV. Combine results from distributed collections into final result
 Step 1 could be eliminated if query is always broadcast to every document collection
 Otherwise one of the algorithms are used for this step
 Eachj of the participating search servers then evaluates the query on the selected collections using its own local
search algorithm.
 Finally, results are merged
 MERGING THE RESULTS:
 No of scenarios used for merging the result
 If the query is Boolean,Boolean result sets are returned and final result result is equal to the union result
set
 If the query involves free text ranking , no. of techniques are available ranging from simple to complex
 Simplest approach: Combine the ranked result lost using round robin interleaving
1: 1st doc from 1st list
2: 2nd doc from 2nd list
3: 3rd doc from 3rd list
 This is likely to produce poor quality results,since hits from irrelevant collections are given status equal to

 Proper global tem statistics are used to compute the document scores
 If documents are randomly distributed such that global term statistics are
consistent across all of the distributed collections, the merging based on
relevance score is sufficient.
 If the documents are semantoically partitioned, then reranking must be
performed.
 RERANKING: By weighing document scores based on their collection similarity
computed during the source selection step.
 The weigth for a collection can be calculated as :
W= 1+|c|.(s-s^1)/s^
Where |c| is the no. of collection searched ,
s is the collection score
s^ is the mean of the collection scores
 More accurate technique for merging ranked result lists is to use accurate global term
statistics.
 If the collection have been indexed for source selection , that index will contain
global term statistics across all of the distributed collections
 The broker can include these statistics in the query when it distributes the query to
the search servers.
 The servers can use this statistics in their processing and produce relevance scores
that can be merged directly.

 If a collection index is unavailable ,query distribution can proceed in two rounds of communication
 In the first round broker distributes the query and gathers collection statistics from each server
 These statistics are combined by the broker and distributed back to the servers in the second round
 The search protocol can also require that the servers return the global query term statistics and pre-document query
term statistics
 The broker is then free to rerank every document using the query term statistics and a ranking algorithm of it’s
choice
 The end result is a list that documents from the distributed collections ranked in the same order as if all of the
documents had been indexed in a single collection.


UNIT_4.pptx

Recommended

Recommended

More Related Content

Similar to UNIT_4.pptx

Similar to UNIT_4.pptx (20)

More from NilamHonmane

More from NilamHonmane (7)

Recently uploaded

Recently uploaded (20)

UNIT_4.pptx