SlideShare a Scribd company logo
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 7Lecture 7 (book chapter 9)(book chapter 9)::
Parallel and Distributed IRParallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: ConclusionsPrevious Chapter: Conclusions
 How to accelerate search? Same results as sequential
 Ideas:
 Quick-and-dirty rejection of bad objects, 100% recall
 Fast data structure for search (based on clustering)
 Careful check of all found candidates
 Solution: mapping into fewer-D feature space
 Condition: lower-bounding of the distance
 Assumption: skewed spectrum distribution
 Few coefficients concentrate energy, rest are less important
Previous Chapter: Research topicsPrevious Chapter: Research topics
 Object detection (pattern and image recognition)
 Automatic feature selection
 Spatial indexing data structures (more than 1D)
 New types of data.
 What features to select? How to determine them?
 Mixed-type data (e.g., webpages, or images with
sound and description)
 What clustering/IR methods are better suited for
what features? (What features for what methods?)
 Similar methods in data mining, ...
The problemThe problem
 Very large document collections
 Google: 4,000,000,000 pages
 Slow response?
 Solution: parallel computing
 Google: 10,000 computers
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instructionstream
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
MIMD architectureMIMD architecture
 The most common
 Can be
 tightly coupled
 loosely coupled
 Distributed
 Many computers interacting via network
 PC Clusters
 Similar to MIMD computers, but greater cost of
communication
 very loosely coupled
 More coarse-grained programs
Performance improvementPerformance improvement
Time: speedup S
 Ideally, N times (number of processors)
 In practice impossible
 The problem does not decompose into N equal parts
 Communication and control overhead
 < 1 / f, where f is the largest separable fraction of the
problem
Cost
 Per processor: S / N
Two approaches to parallelismTwo approaches to parallelism
 Build new algorithms
 E.g., neural nets
 Naturally parallel
 Problem: to define the retrieval task
 Adapt the existing techniques to parallelism
 Allows relying on well-studied approaches
 We will consider this option
Ways to use parallelismWays to use parallelism
 Multitasking
 N search engines
 Good for processing many queries
Problems:
 A single query is not speeded up
 Bottleneck: disk access (index)
 Possible solution: replicating (part of) data. RAIDs
 Parallel algorithms
 IR = data. Main question: how to partition the data
 Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
Possible partitioningsPossible partitionings
 Horizontal: document partitioning. Union of results
 Vertical: term partitioning. Basically, intersect results
Inverted files: Logical partitioningInverted files: Logical partitioning
 Logical vs. physical document partitioning
 Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
Inverted files: Logical partitioningInverted files: Logical partitioning
Construction and updatingConstruction and updating
 Also parallel
Construction
 Assign docs to processors
 Order docs such that each processor has an interval
 Process in parallel
 Merge. Each piece is ordered already
Inverted files:Inverted files:
Physical document partitioningPhysical document partitioning
 Several separate collections, one per processor
 Separate indices
 Then the lists are merged (they are already ordered)
 Priority queue is used
 The result is not sorted; Insertion is quick
 The maximal element can be found quickly
 First k elements can be found rather quickly
 Details in the book
 Consistent scores are needed
 Global statistics is needed. Can be computed at index
time
Logical or physical partitioning?Logical or physical partitioning?
 Logical requires less communication
 Faster
 Physical is more flexible. Simpler implementation
 Simpler conversion of existing systems
Inverted files:Inverted files: Term partitioningTerm partitioning
 Each processor processes a part of the inverted file
 The results are intersected (for AND)
 (or as appropriate for Boolean operations, OR and NOT)
 When term distribution in user queries is skewed,
then document partitioning is better
 When uniform, term partitioning is better.
 Twice for long queries, 5 – 10 times for short (Web-like)
Suffix arraysSuffix arrays
 Array construction can be parallelized
 merges are parallel
 Document partitioning is applied straightforwardly
 Each processor maintains its own suffix array
 Term partitioning can be applied
 Each processor owns a branch of the tree (lexicographic
interval)
 Bottleneck: all processors need access to the entire text
Signature filesSignature files
 Document partitioning: straightforward
 Create query signature, distribute to each processor
 Merge results (using Boolean operations if needed)
 Term partitioning: shorter signatures
 Merging and eliminating false drops is slow
 This method is not recommended
SIMD computersSIMD computers
 Single Instruction, Multiple data
 Uncommon
 Good for simple operations
 Bit operations in signature files
 Details in the book
 Ranking is supported in hardware in some computers
 If signature file does not fit into memory, can be
processed in batches
 I/O overhead
 Use multiple queries with the same batch
 This improves throughput, but not response time
…… SIMD computersSIMD computers
 Inverted files are difficult to adapt to SIMD
 The inverted file is restructured
 Details in the book
Distributed IRDistributed IR
 MIMD with
 Slow communication
 Not all nodes are used for a given query
 Encryption issues
 Document partitioning is usually used
 Term partitioning imposes greater communication
overhead
 Document clustering can be useful (to distribute docs
by processors)
 Index clusters and then search only the best ones
 Another approach: use training queries, then similarity of
the user query to these
Research topicsResearch topics
 How to evaluate the speedup
 New algorithms
 Adaptation of existing algorithms
 Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
ConclusionsConclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow
communication
 SIMD machines are good for Signature files
 Both are out of favor now
Thank you!
Till May 17? 18?, 6 pm

More Related Content

What's hot

Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
Syed Zaid Irshad
 
data mining
data miningdata mining
data mining
manasa polu
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
ShivaVemula2
 
Query processing in Distributed Database System
Query processing in Distributed Database SystemQuery processing in Distributed Database System
Query processing in Distributed Database System
Meghaj Mallick
 
Distributed Query Processing
Distributed Query ProcessingDistributed Query Processing
Distributed Query Processing
Mythili Kannan
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
Sanghyuk Chun
 
Distributed System ppt
Distributed System pptDistributed System ppt
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
Temesgenthanks
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
FellowBuddy.com
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
Sunita Sahu
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
Amritanshu Mehra
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
Burhan Ahmed
 
Topic detection & tracking
Topic detection & trackingTopic detection & tracking
Topic detection & tracking
George Ang
 
Legal issues in cloud computing
Legal issues in cloud computingLegal issues in cloud computing
Legal issues in cloud computing
movinghats
 
Principle source of optimazation
Principle source of optimazationPrinciple source of optimazation
Principle source of optimazation
Siva Sathya
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
pkaviya
 
Term weighting
Term weightingTerm weighting
Term weighting
Primya Tamil
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
shivli0769
 

What's hot (20)

Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
data mining
data miningdata mining
data mining
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Query processing in Distributed Database System
Query processing in Distributed Database SystemQuery processing in Distributed Database System
Query processing in Distributed Database System
 
Distributed Query Processing
Distributed Query ProcessingDistributed Query Processing
Distributed Query Processing
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Topic detection & tracking
Topic detection & trackingTopic detection & tracking
Topic detection & tracking
 
Legal issues in cloud computing
Legal issues in cloud computingLegal issues in cloud computing
Legal issues in cloud computing
 
Principle source of optimazation
Principle source of optimazationPrinciple source of optimazation
Principle source of optimazation
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 

Viewers also liked

Presentation parallelsystem
Presentation parallelsystemPresentation parallelsystem
Presentation parallelsystem
cegonsoft1999
 
Centralized vs distrbution system
Centralized vs distrbution systemCentralized vs distrbution system
Centralized vs distrbution system
zirram
 
Centralised and distributed databases
Centralised and distributed databasesCentralised and distributed databases
Centralised and distributed databases
Forrester High School
 
Cab booking system india
Cab booking system indiaCab booking system india
Cab booking system india
Custom Soft
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
Sudarsun Santhiappan
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
Parallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 ProjectsParallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 Projects
Vijay Karan
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
VESIT/University of Mumbai
 

Viewers also liked (8)

Presentation parallelsystem
Presentation parallelsystemPresentation parallelsystem
Presentation parallelsystem
 
Centralized vs distrbution system
Centralized vs distrbution systemCentralized vs distrbution system
Centralized vs distrbution system
 
Centralised and distributed databases
Centralised and distributed databasesCentralised and distributed databases
Centralised and distributed databases
 
Cab booking system india
Cab booking system indiaCab booking system india
Cab booking system india
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Parallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 ProjectsParallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 Projects
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 

Similar to Parallel and Distributed Information Retrieval System

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
San Diego Supercomputer Center
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
Yahoo Developer Network
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
MUHAMMAD AAMIR
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
Nick Boucart
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
Umair Amjad
 
Grid1
Grid1Grid1
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Introduction
IntroductionIntroduction
Introduction
sarojbhavaraju5
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
PNUTS
PNUTSPNUTS
Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
Ruchika Mehresh
 
Pnuts
PnutsPnuts

Similar to Parallel and Distributed Information Retrieval System (20)

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
 
Grid1
Grid1Grid1
Grid1
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Introduction
IntroductionIntroduction
Introduction
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
PNUTS
PNUTSPNUTS
PNUTS
 
Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
 
Pnuts
PnutsPnuts
Pnuts
 

Recently uploaded

CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 

Recently uploaded (20)

CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 

Parallel and Distributed Information Retrieval System

  • 1. Special Topics in Computer ScienceSpecial Topics in Computer Science Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7Lecture 7 (book chapter 9)(book chapter 9):: Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh www.Gelbukh.com
  • 2. Previous Chapter: ConclusionsPrevious Chapter: Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important
  • 3. Previous Chapter: Research topicsPrevious Chapter: Research topics  Object detection (pattern and image recognition)  Automatic feature selection  Spatial indexing data structures (more than 1D)  New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ...
  • 4. The problemThe problem  Very large document collections  Google: 4,000,000,000 pages  Slow response?  Solution: parallel computing  Google: 10,000 computers
  • 5. Parallel architecturesParallel architectures Data stream Single Multiple Instructionstream Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD
  • 6. MIMD architectureMIMD architecture  The most common  Can be  tightly coupled  loosely coupled  Distributed  Many computers interacting via network  PC Clusters  Similar to MIMD computers, but greater cost of communication  very loosely coupled  More coarse-grained programs
  • 7. Performance improvementPerformance improvement Time: speedup S  Ideally, N times (number of processors)  In practice impossible  The problem does not decompose into N equal parts  Communication and control overhead  < 1 / f, where f is the largest separable fraction of the problem Cost  Per processor: S / N
  • 8. Two approaches to parallelismTwo approaches to parallelism  Build new algorithms  E.g., neural nets  Naturally parallel  Problem: to define the retrieval task  Adapt the existing techniques to parallelism  Allows relying on well-studied approaches  We will consider this option
  • 9. Ways to use parallelismWays to use parallelism  Multitasking  N search engines  Good for processing many queries Problems:  A single query is not speeded up  Bottleneck: disk access (index)  Possible solution: replicating (part of) data. RAIDs  Parallel algorithms  IR = data. Main question: how to partition the data  Document / index term matrix (terms can be LSI dimensions, signature bits, etc)
  • 10. Possible partitioningsPossible partitionings  Horizontal: document partitioning. Union of results  Vertical: term partitioning. Basically, intersect results
  • 11. Inverted files: Logical partitioningInverted files: Logical partitioning  Logical vs. physical document partitioning  Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion
  • 12. Inverted files: Logical partitioningInverted files: Logical partitioning Construction and updatingConstruction and updating  Also parallel Construction  Assign docs to processors  Order docs such that each processor has an interval  Process in parallel  Merge. Each piece is ordered already
  • 13. Inverted files:Inverted files: Physical document partitioningPhysical document partitioning  Several separate collections, one per processor  Separate indices  Then the lists are merged (they are already ordered)  Priority queue is used  The result is not sorted; Insertion is quick  The maximal element can be found quickly  First k elements can be found rather quickly  Details in the book  Consistent scores are needed  Global statistics is needed. Can be computed at index time
  • 14. Logical or physical partitioning?Logical or physical partitioning?  Logical requires less communication  Faster  Physical is more flexible. Simpler implementation  Simpler conversion of existing systems
  • 15. Inverted files:Inverted files: Term partitioningTerm partitioning  Each processor processes a part of the inverted file  The results are intersected (for AND)  (or as appropriate for Boolean operations, OR and NOT)  When term distribution in user queries is skewed, then document partitioning is better  When uniform, term partitioning is better.  Twice for long queries, 5 – 10 times for short (Web-like)
  • 16. Suffix arraysSuffix arrays  Array construction can be parallelized  merges are parallel  Document partitioning is applied straightforwardly  Each processor maintains its own suffix array  Term partitioning can be applied  Each processor owns a branch of the tree (lexicographic interval)  Bottleneck: all processors need access to the entire text
  • 17.
  • 18. Signature filesSignature files  Document partitioning: straightforward  Create query signature, distribute to each processor  Merge results (using Boolean operations if needed)  Term partitioning: shorter signatures  Merging and eliminating false drops is slow  This method is not recommended
  • 19. SIMD computersSIMD computers  Single Instruction, Multiple data  Uncommon  Good for simple operations  Bit operations in signature files  Details in the book  Ranking is supported in hardware in some computers  If signature file does not fit into memory, can be processed in batches  I/O overhead  Use multiple queries with the same batch  This improves throughput, but not response time
  • 20. …… SIMD computersSIMD computers  Inverted files are difficult to adapt to SIMD  The inverted file is restructured  Details in the book
  • 21. Distributed IRDistributed IR  MIMD with  Slow communication  Not all nodes are used for a given query  Encryption issues  Document partitioning is usually used  Term partitioning imposes greater communication overhead  Document clustering can be useful (to distribute docs by processors)  Index clusters and then search only the best ones  Another approach: use training queries, then similarity of the user query to these
  • 22. Research topicsResearch topics  How to evaluate the speedup  New algorithms  Adaptation of existing algorithms  Merging the results is a bottleneck  Meta search engines  Creating large collections with judgements  Is recall important?
  • 23. ConclusionsConclusions  Parallel computing can improve  response time for each query and/or  throughput: number of queries processed with same speed  Document partitioning is simple  good for distributed computing  Term partitioning is good for some data structures  Distributed computing is MIMD computing with slow communication  SIMD machines are good for Signature files  Both are out of favor now
  • 24. Thank you! Till May 17? 18?, 6 pm