SlideShare a Scribd company logo
1 of 12
DISTRIBUTED IFORMATION RETRIEVAL
 Distributed computing is a field of computer science that studies distributed systems. A distributed system is a
system whose components are located on different networked computers, which communicate and coordinate
their actions by passing messages to one another from any system.
 Distributed computing is a model in which components of a software system are shared among multiple
computers or nodes.
 Telephone and cellular networks are also examples of distributed networks. Telephone networks have been
around for over a century and it started as an early example of a peer to peer network. Cellular networks are
distributed networks with base stations physically distributed in areas called cells.
 Distributed computing allows different users or computers to share information. Distributed computing can
allow an application on one machine to leverage processing power, memory, or storage on another machine.
 A multi database model of distributed information retrieval is presented in which people are assumed to have
access to many searchable text databases In such an environment full text information retrieval consists of
discovering database contents ranking databases by their expected ability to satisfy the query searching a small
number of databases and merging results returned by dierent databases
DISTRIBUTED IR
 Can be viewed as MIMD parallel processor
 Relatively slow interprocessor communication
 Freedom to employ a heterogenous collection of processors in the system.
 Single processing node in DC could be a parallel computer in its own
 If they support same public interface and protocol for invoking their services, computers in the system can
be owned and operated by diff parties
 Two main difference:
i. Subtasks runs on diff comp and communication between the subtask is performed using TCP/IP
rather than the shared memory base inter-process communication.
ii. Employs a procedure for selecting a subset of distributed servers for processing a particular request
rather than broadcasting every request to every server.
 Shared Memory MIMD architecture characteristics:
i. Creates a group of memory modules and processors.
ii. Any processor is able to directly access any memory module by means of an interconnection
network.
iii. The group of memory modules outlines a universal address space that is shared between the
processors.
 Dc usually involves computation and data
 Splitted into coarse-grained operations with relatively little communication required
between the operations.
 Parallel IR based on document partitioning fits well.
 Documents are always grouped into collections
 Either for administrative purposes or for combining similar doc into one source
 Collection provides natural granularity for distributing data across servers and
partitioning the computation.
 Consider both engineering issues of Dc and algorithm issues of IR.
 Engineering issues involve:
i. Defining a search protocol for transmitting requestsa and results
ii. Designing a server that can efficiently accept a request, initiate a subprocess or
thread to service the request.
iii. Exploit any locality inherent in the processing using appropriate caching techniques
iv. Designing a broker that can submit a synchronous search request to multiple
servers in parallel and Combine the intermediate results into a final end user
response.
 Algorithmic isues involves:
i. How to distributed documents across distributed search servers
ii. How to select which server should receive the particular search request
iii. How to combine the results from the different servers
 A protocol should allow a client to:
i. Obtain info about a search server e.g. loist of databases available for searching
at the server.
ii. Submit search request for one or more databases available using well defined
query language
iii. Receive search result in a well defined format
iv. Retrieved items identified in the serach results
COLLECTION PARTITIONING:
The Partitioning allows you to specify details about how the incoming data is partitioned or collected
before the operation is performed.
It also allows you to specify that the data should be sorted before being operated on.
Collection Partitioning in the decentralized system:
 Distributed document collections are built and maintained independently.
 No central control of document partitioning procedure
 Each server is focused on particular subject area
Collection Partitioning in centralized system:
 Collection can be replicated across all of the search server
 Parallelism is being exploited via multitasking
 Broker’s job is to route queries to the search servers and balance the loads on the
servers
 First option is simple replication across all of the search servers
 Second option is random distribution of the documents
 Final option is explicit semantic partitioning of the documents
SOURCE SELECTION
 Process of determining which of the distributed collections are most likely to contain
relevant documents of the current query and therefore should receive the query for
processing
 There are two approaches:
1. Simple Approach: Assume that every collection is likely to contain relevant
document and always broadcast the query to all collections
 Appropriate when documents are randomly partitioned
2. Can also be ranked according to their likelihood of containing relevant documents
 Appropriate when
i. documents are partitioned into semantically meaningful collections
ii. It is prohibitively expensive to search every collection every time
 Basic technique is to treat each collection as if it were a single large
document
 Index collections
 Evaluate the query against the collection to produce a ranked listing of
collections
 Apply standard cosine similarity measure using a query vector and collection
vectors
 To calculate term weight in the collection vector using tf-idf style
 Weighing term frequency tfi,j is the total no. of ocurrences of term I in collection j
 Inverse document frequency idfi for term I is log(N/ni) where N is the total no of collections
and ni is the no. of collections in which term i appears
 Problem with this approach is there may not be individual documents within the collection
that receive high query relevance score, essentially resulting in a false drop and unnecessary
work to score a collection
 To avoid this problem, Moffat and Zobel proposed a solution by indexing each collection as
series of block, where each block contains B documents
 When B=1, this is equivalent to indexing all of the document as a single, monolithic collection
 When B equals the number of documents in each collection, this is equivalent to original
solution
 By varying B, a trade off is made between collection index size and likelihood of falsedrops..
QUERY PROCESSING
 Proceeds as follows:
I. Select collections to search
II. Distribute query to selected collections
III. Evaluate query at distributed collections in parallel
IV. Combine results from distributed collections into final result
 Step 1 could be eliminated if query is always broadcast to every document collection
 Otherwise one of the algorithms are used for this step
 Eachj of the participating search servers then evaluates the query on the selected collections using its own local
search algorithm.
 Finally, results are merged
 MERGING THE RESULTS:
 No of scenarios used for merging the result
 If the query is Boolean,Boolean result sets are returned and final result result is equal to the union result
set
 If the query involves free text ranking , no. of techniques are available ranging from simple to complex
 Simplest approach: Combine the ranked result lost using round robin interleaving
1: 1st doc from 1st list
2: 2nd doc from 2nd list
3: 3rd doc from 3rd list
 This is likely to produce poor quality results,since hits from irrelevant collections are given status equal to
ROUND ROBIN PARTTIONING
 Proper global tem statistics are used to compute the document scores
 If documents are randomly distributed such that global term statistics are
consistent across all of the distributed collections, the merging based on
relevance score is sufficient.
 If the documents are semantoically partitioned, then reranking must be
performed.
 RERANKING: By weighing document scores based on their collection similarity
computed during the source selection step.
 The weigth for a collection can be calculated as :
W= 1+|c|.(s-s^1)/s^
Where |c| is the no. of collection searched ,
s is the collection score
s^ is the mean of the collection scores
 More accurate technique for merging ranked result lists is to use accurate global term
statistics.
 If the collection have been indexed for source selection , that index will contain
global term statistics across all of the distributed collections
 The broker can include these statistics in the query when it distributes the query to
the search servers.
 The servers can use this statistics in their processing and produce relevance scores
that can be merged directly.
 If a collection index is unavailable ,query distribution can proceed in two rounds of communication
 In the first round broker distributes the query and gathers collection statistics from each server
 These statistics are combined by the broker and distributed back to the servers in the second round
 The search protocol can also require that the servers return the global query term statistics and pre-document query
term statistics
 The broker is then free to rerank every document using the query term statistics and a ranking algorithm of it’s
choice
 The end result is a list that documents from the distributed collections ranked in the same order as if all of the
documents had been indexed in a single collection.


More Related Content

Similar to UNIT_4.pptx

Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data FusionIRJET Journal
 
Architectural patterns part 1
Architectural patterns part 1Architectural patterns part 1
Architectural patterns part 1assinha
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
20IT703_PDS_PPT_Unit_I.ppt
20IT703_PDS_PPT_Unit_I.ppt20IT703_PDS_PPT_Unit_I.ppt
20IT703_PDS_PPT_Unit_I.pptsuganthi66742
 
Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...M H
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksElenaEtchemendy1
 
Toward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k ProcessingToward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k Processingasapteam
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Towards a distributed framework to analyze multimodal data.pdf
Towards a distributed framework to analyze multimodal data.pdfTowards a distributed framework to analyze multimodal data.pdf
Towards a distributed framework to analyze multimodal data.pdfCarlosRodrigues517978
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
Multilayer Collection Selection and Search of Topically Organized Patents
Multilayer Collection Selection and Search of Topically Organized PatentsMultilayer Collection Selection and Search of Topically Organized Patents
Multilayer Collection Selection and Search of Topically Organized PatentsMike Salampasis
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463IJRAT
 

Similar to UNIT_4.pptx (20)

Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
P2P Cache Resolution System for MANET
P2P Cache Resolution System for MANETP2P Cache Resolution System for MANET
P2P Cache Resolution System for MANET
 
Architectural patterns part 1
Architectural patterns part 1Architectural patterns part 1
Architectural patterns part 1
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
20IT703_PDS_PPT_Unit_I.ppt
20IT703_PDS_PPT_Unit_I.ppt20IT703_PDS_PPT_Unit_I.ppt
20IT703_PDS_PPT_Unit_I.ppt
 
Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...Information extraction from sensor networks using the Watershed transform alg...
Information extraction from sensor networks using the Watershed transform alg...
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Data mining
Data miningData mining
Data mining
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworks
 
Distributed Systems.pptx
Distributed Systems.pptxDistributed Systems.pptx
Distributed Systems.pptx
 
Toward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k ProcessingToward Personalized Peer-to-Peer Top-k Processing
Toward Personalized Peer-to-Peer Top-k Processing
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Towards a distributed framework to analyze multimodal data.pdf
Towards a distributed framework to analyze multimodal data.pdfTowards a distributed framework to analyze multimodal data.pdf
Towards a distributed framework to analyze multimodal data.pdf
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
Multilayer Collection Selection and Search of Topically Organized Patents
Multilayer Collection Selection and Search of Topically Organized PatentsMultilayer Collection Selection and Search of Topically Organized Patents
Multilayer Collection Selection and Search of Topically Organized Patents
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 

More from NilamHonmane

UNIT 1 Web Application Develpoment HTTP and CSS
UNIT 1 Web Application Develpoment HTTP and CSSUNIT 1 Web Application Develpoment HTTP and CSS
UNIT 1 Web Application Develpoment HTTP and CSSNilamHonmane
 
2-background-5g.ppt
2-background-5g.ppt2-background-5g.ppt
2-background-5g.pptNilamHonmane
 
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.pptNilamHonmane
 
Introduction to Investor.pptx
Introduction to Investor.pptxIntroduction to Investor.pptx
Introduction to Investor.pptxNilamHonmane
 

More from NilamHonmane (7)

UNIT 1 Web Application Develpoment HTTP and CSS
UNIT 1 Web Application Develpoment HTTP and CSSUNIT 1 Web Application Develpoment HTTP and CSS
UNIT 1 Web Application Develpoment HTTP and CSS
 
2-background-5g.ppt
2-background-5g.ppt2-background-5g.ppt
2-background-5g.ppt
 
unit-ii.pptx
unit-ii.pptxunit-ii.pptx
unit-ii.pptx
 
note_vc.ppt
note_vc.pptnote_vc.ppt
note_vc.ppt
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt
373_23865_CR315_2011_1__2_1_CH09 Mobile Computing.ppt
 
Introduction to Investor.pptx
Introduction to Investor.pptxIntroduction to Investor.pptx
Introduction to Investor.pptx
 

Recently uploaded

Call Girl Kolkata Sia 🤌 8250192130 🚀 Vip Call Girls Kolkata
Call Girl Kolkata Sia 🤌  8250192130 🚀 Vip Call Girls KolkataCall Girl Kolkata Sia 🤌  8250192130 🚀 Vip Call Girls Kolkata
Call Girl Kolkata Sia 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130  Available With RoomVIP Kolkata Call Girl Entally 👉 8250192130  Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Roomdivyansh0kumar0
 
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书Fir La
 
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls KolkataHigh Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
VIP Kolkata Call Girl Rishra 👉 8250192130 Available With Room
VIP Kolkata Call Girl Rishra 👉 8250192130  Available With RoomVIP Kolkata Call Girl Rishra 👉 8250192130  Available With Room
VIP Kolkata Call Girl Rishra 👉 8250192130 Available With Roomdivyansh0kumar0
 
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024Osisko Gold Royalties Ltd
 
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一Fir La
 
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书Fir La
 
Malad Escorts, (Pooja 09892124323), Malad Call Girls Service
Malad Escorts, (Pooja 09892124323), Malad Call Girls ServiceMalad Escorts, (Pooja 09892124323), Malad Call Girls Service
Malad Escorts, (Pooja 09892124323), Malad Call Girls ServicePooja Nehwal
 
Cyberagent_For New Investors_EN_240424.pdf
Cyberagent_For New Investors_EN_240424.pdfCyberagent_For New Investors_EN_240424.pdf
Cyberagent_For New Investors_EN_240424.pdfCyberAgent, Inc.
 

Recently uploaded (20)

Call Girl Kolkata Sia 🤌 8250192130 🚀 Vip Call Girls Kolkata
Call Girl Kolkata Sia 🤌  8250192130 🚀 Vip Call Girls KolkataCall Girl Kolkata Sia 🤌  8250192130 🚀 Vip Call Girls Kolkata
Call Girl Kolkata Sia 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Escort Service Call Girls In Shalimar Bagh, 99530°56974 Delhi NCR
Escort Service Call Girls In Shalimar Bagh, 99530°56974 Delhi NCREscort Service Call Girls In Shalimar Bagh, 99530°56974 Delhi NCR
Escort Service Call Girls In Shalimar Bagh, 99530°56974 Delhi NCR
 
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130  Available With RoomVIP Kolkata Call Girl Entally 👉 8250192130  Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
 
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书
如何办理密苏里大学堪萨斯分校毕业证(文凭)UMKC学位证书
 
Call Girls 🫤 Nehru Place ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ENJOY
Call Girls 🫤 Nehru Place ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ENJOYCall Girls 🫤 Nehru Place ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ENJOY
Call Girls 🫤 Nehru Place ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ENJOY
 
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls KolkataHigh Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
 
VIP Kolkata Call Girl Rishra 👉 8250192130 Available With Room
VIP Kolkata Call Girl Rishra 👉 8250192130  Available With RoomVIP Kolkata Call Girl Rishra 👉 8250192130  Available With Room
VIP Kolkata Call Girl Rishra 👉 8250192130 Available With Room
 
Call Girls In South Delhi 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Delhi 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Delhi 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Delhi 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024
Osisko Gold Royalties Ltd - Corporate Presentation, April 23, 2024
 
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
定制(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCRCall Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
 
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls ServicesPreet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
 
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Serviceyoung Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
 
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书
如何办理北卡罗来纳大学教堂山分校毕业证(文凭)UNC学位证书
 
Malad Escorts, (Pooja 09892124323), Malad Call Girls Service
Malad Escorts, (Pooja 09892124323), Malad Call Girls ServiceMalad Escorts, (Pooja 09892124323), Malad Call Girls Service
Malad Escorts, (Pooja 09892124323), Malad Call Girls Service
 
young call girls in Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
young  call girls in   Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Serviceyoung  call girls in   Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
 
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
 
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
 
Cyberagent_For New Investors_EN_240424.pdf
Cyberagent_For New Investors_EN_240424.pdfCyberagent_For New Investors_EN_240424.pdf
Cyberagent_For New Investors_EN_240424.pdf
 

UNIT_4.pptx

  • 1. DISTRIBUTED IFORMATION RETRIEVAL  Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system.  Distributed computing is a model in which components of a software system are shared among multiple computers or nodes.  Telephone and cellular networks are also examples of distributed networks. Telephone networks have been around for over a century and it started as an early example of a peer to peer network. Cellular networks are distributed networks with base stations physically distributed in areas called cells.  Distributed computing allows different users or computers to share information. Distributed computing can allow an application on one machine to leverage processing power, memory, or storage on another machine.  A multi database model of distributed information retrieval is presented in which people are assumed to have access to many searchable text databases In such an environment full text information retrieval consists of discovering database contents ranking databases by their expected ability to satisfy the query searching a small number of databases and merging results returned by dierent databases
  • 2. DISTRIBUTED IR  Can be viewed as MIMD parallel processor  Relatively slow interprocessor communication  Freedom to employ a heterogenous collection of processors in the system.  Single processing node in DC could be a parallel computer in its own  If they support same public interface and protocol for invoking their services, computers in the system can be owned and operated by diff parties  Two main difference: i. Subtasks runs on diff comp and communication between the subtask is performed using TCP/IP rather than the shared memory base inter-process communication. ii. Employs a procedure for selecting a subset of distributed servers for processing a particular request rather than broadcasting every request to every server.  Shared Memory MIMD architecture characteristics: i. Creates a group of memory modules and processors. ii. Any processor is able to directly access any memory module by means of an interconnection network. iii. The group of memory modules outlines a universal address space that is shared between the processors.
  • 3.
  • 4.  Dc usually involves computation and data  Splitted into coarse-grained operations with relatively little communication required between the operations.  Parallel IR based on document partitioning fits well.  Documents are always grouped into collections  Either for administrative purposes or for combining similar doc into one source  Collection provides natural granularity for distributing data across servers and partitioning the computation.  Consider both engineering issues of Dc and algorithm issues of IR.  Engineering issues involve: i. Defining a search protocol for transmitting requestsa and results ii. Designing a server that can efficiently accept a request, initiate a subprocess or thread to service the request. iii. Exploit any locality inherent in the processing using appropriate caching techniques iv. Designing a broker that can submit a synchronous search request to multiple servers in parallel and Combine the intermediate results into a final end user response.  Algorithmic isues involves: i. How to distributed documents across distributed search servers ii. How to select which server should receive the particular search request iii. How to combine the results from the different servers
  • 5.  A protocol should allow a client to: i. Obtain info about a search server e.g. loist of databases available for searching at the server. ii. Submit search request for one or more databases available using well defined query language iii. Receive search result in a well defined format iv. Retrieved items identified in the serach results
  • 6. COLLECTION PARTITIONING: The Partitioning allows you to specify details about how the incoming data is partitioned or collected before the operation is performed. It also allows you to specify that the data should be sorted before being operated on. Collection Partitioning in the decentralized system:  Distributed document collections are built and maintained independently.  No central control of document partitioning procedure  Each server is focused on particular subject area Collection Partitioning in centralized system:  Collection can be replicated across all of the search server  Parallelism is being exploited via multitasking  Broker’s job is to route queries to the search servers and balance the loads on the servers  First option is simple replication across all of the search servers  Second option is random distribution of the documents  Final option is explicit semantic partitioning of the documents
  • 7. SOURCE SELECTION  Process of determining which of the distributed collections are most likely to contain relevant documents of the current query and therefore should receive the query for processing  There are two approaches: 1. Simple Approach: Assume that every collection is likely to contain relevant document and always broadcast the query to all collections  Appropriate when documents are randomly partitioned 2. Can also be ranked according to their likelihood of containing relevant documents  Appropriate when i. documents are partitioned into semantically meaningful collections ii. It is prohibitively expensive to search every collection every time  Basic technique is to treat each collection as if it were a single large document  Index collections  Evaluate the query against the collection to produce a ranked listing of collections  Apply standard cosine similarity measure using a query vector and collection vectors
  • 8.  To calculate term weight in the collection vector using tf-idf style  Weighing term frequency tfi,j is the total no. of ocurrences of term I in collection j  Inverse document frequency idfi for term I is log(N/ni) where N is the total no of collections and ni is the no. of collections in which term i appears  Problem with this approach is there may not be individual documents within the collection that receive high query relevance score, essentially resulting in a false drop and unnecessary work to score a collection  To avoid this problem, Moffat and Zobel proposed a solution by indexing each collection as series of block, where each block contains B documents  When B=1, this is equivalent to indexing all of the document as a single, monolithic collection  When B equals the number of documents in each collection, this is equivalent to original solution  By varying B, a trade off is made between collection index size and likelihood of falsedrops..
  • 9. QUERY PROCESSING  Proceeds as follows: I. Select collections to search II. Distribute query to selected collections III. Evaluate query at distributed collections in parallel IV. Combine results from distributed collections into final result  Step 1 could be eliminated if query is always broadcast to every document collection  Otherwise one of the algorithms are used for this step  Eachj of the participating search servers then evaluates the query on the selected collections using its own local search algorithm.  Finally, results are merged  MERGING THE RESULTS:  No of scenarios used for merging the result  If the query is Boolean,Boolean result sets are returned and final result result is equal to the union result set  If the query involves free text ranking , no. of techniques are available ranging from simple to complex  Simplest approach: Combine the ranked result lost using round robin interleaving 1: 1st doc from 1st list 2: 2nd doc from 2nd list 3: 3rd doc from 3rd list  This is likely to produce poor quality results,since hits from irrelevant collections are given status equal to
  • 11.  Proper global tem statistics are used to compute the document scores  If documents are randomly distributed such that global term statistics are consistent across all of the distributed collections, the merging based on relevance score is sufficient.  If the documents are semantoically partitioned, then reranking must be performed.  RERANKING: By weighing document scores based on their collection similarity computed during the source selection step.  The weigth for a collection can be calculated as : W= 1+|c|.(s-s^1)/s^ Where |c| is the no. of collection searched , s is the collection score s^ is the mean of the collection scores  More accurate technique for merging ranked result lists is to use accurate global term statistics.  If the collection have been indexed for source selection , that index will contain global term statistics across all of the distributed collections  The broker can include these statistics in the query when it distributes the query to the search servers.  The servers can use this statistics in their processing and produce relevance scores that can be merged directly.
  • 12.  If a collection index is unavailable ,query distribution can proceed in two rounds of communication  In the first round broker distributes the query and gathers collection statistics from each server  These statistics are combined by the broker and distributed back to the servers in the second round  The search protocol can also require that the servers return the global query term statistics and pre-document query term statistics  The broker is then free to rerank every document using the query term statistics and a ranking algorithm of it’s choice  The end result is a list that documents from the distributed collections ranked in the same order as if all of the documents had been indexed in a single collection. 