Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 7Lecture 7 (book chapter 9)(book chapter 9)::
Parallel and Distributed IRParallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: ConclusionsPrevious Chapter: Conclusions
 How to accelerate search? Same results as sequential
 Ideas:
 Quick-and-dirty rejection of bad objects, 100% recall
 Fast data structure for search (based on clustering)
 Careful check of all found candidates
 Solution: mapping into fewer-D feature space
 Condition: lower-bounding of the distance
 Assumption: skewed spectrum distribution
 Few coefficients concentrate energy, rest are less important
Previous Chapter: Research topicsPrevious Chapter: Research topics
 Object detection (pattern and image recognition)
 Automatic feature selection
 Spatial indexing data structures (more than 1D)
 New types of data.
 What features to select? How to determine them?
 Mixed-type data (e.g., webpages, or images with
sound and description)
 What clustering/IR methods are better suited for
what features? (What features for what methods?)
 Similar methods in data mining, ...
The problemThe problem
 Very large document collections
 Google: 4,000,000,000 pages
 Slow response?
 Solution: parallel computing
 Google: 10,000 computers
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instructionstream
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
MIMD architectureMIMD architecture
 The most common
 Can be
 tightly coupled
 loosely coupled
 Distributed
 Many computers interacting via network
 PC Clusters
 Similar to MIMD computers, but greater cost of
communication
 very loosely coupled
 More coarse-grained programs
Performance improvementPerformance improvement
Time: speedup S
 Ideally, N times (number of processors)
 In practice impossible
 The problem does not decompose into N equal parts
 Communication and control overhead
 < 1 / f, where f is the largest separable fraction of the
problem
Cost
 Per processor: S / N
Two approaches to parallelismTwo approaches to parallelism
 Build new algorithms
 E.g., neural nets
 Naturally parallel
 Problem: to define the retrieval task
 Adapt the existing techniques to parallelism
 Allows relying on well-studied approaches
 We will consider this option
Ways to use parallelismWays to use parallelism
 Multitasking
 N search engines
 Good for processing many queries
Problems:
 A single query is not speeded up
 Bottleneck: disk access (index)
 Possible solution: replicating (part of) data. RAIDs
 Parallel algorithms
 IR = data. Main question: how to partition the data
 Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
Possible partitioningsPossible partitionings
 Horizontal: document partitioning. Union of results
 Vertical: term partitioning. Basically, intersect results
Inverted files: Logical partitioningInverted files: Logical partitioning
 Logical vs. physical document partitioning
 Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
Inverted files: Logical partitioningInverted files: Logical partitioning
Construction and updatingConstruction and updating
 Also parallel
Construction
 Assign docs to processors
 Order docs such that each processor has an interval
 Process in parallel
 Merge. Each piece is ordered already
Inverted files:Inverted files:
Physical document partitioningPhysical document partitioning
 Several separate collections, one per processor
 Separate indices
 Then the lists are merged (they are already ordered)
 Priority queue is used
 The result is not sorted; Insertion is quick
 The maximal element can be found quickly
 First k elements can be found rather quickly
 Details in the book
 Consistent scores are needed
 Global statistics is needed. Can be computed at index
time
Logical or physical partitioning?Logical or physical partitioning?
 Logical requires less communication
 Faster
 Physical is more flexible. Simpler implementation
 Simpler conversion of existing systems
Inverted files:Inverted files: Term partitioningTerm partitioning
 Each processor processes a part of the inverted file
 The results are intersected (for AND)
 (or as appropriate for Boolean operations, OR and NOT)
 When term distribution in user queries is skewed,
then document partitioning is better
 When uniform, term partitioning is better.
 Twice for long queries, 5 – 10 times for short (Web-like)
Suffix arraysSuffix arrays
 Array construction can be parallelized
 merges are parallel
 Document partitioning is applied straightforwardly
 Each processor maintains its own suffix array
 Term partitioning can be applied
 Each processor owns a branch of the tree (lexicographic
interval)
 Bottleneck: all processors need access to the entire text
Signature filesSignature files
 Document partitioning: straightforward
 Create query signature, distribute to each processor
 Merge results (using Boolean operations if needed)
 Term partitioning: shorter signatures
 Merging and eliminating false drops is slow
 This method is not recommended
SIMD computersSIMD computers
 Single Instruction, Multiple data
 Uncommon
 Good for simple operations
 Bit operations in signature files
 Details in the book
 Ranking is supported in hardware in some computers
 If signature file does not fit into memory, can be
processed in batches
 I/O overhead
 Use multiple queries with the same batch
 This improves throughput, but not response time
…… SIMD computersSIMD computers
 Inverted files are difficult to adapt to SIMD
 The inverted file is restructured
 Details in the book
Distributed IRDistributed IR
 MIMD with
 Slow communication
 Not all nodes are used for a given query
 Encryption issues
 Document partitioning is usually used
 Term partitioning imposes greater communication
overhead
 Document clustering can be useful (to distribute docs
by processors)
 Index clusters and then search only the best ones
 Another approach: use training queries, then similarity of
the user query to these
Research topicsResearch topics
 How to evaluate the speedup
 New algorithms
 Adaptation of existing algorithms
 Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
ConclusionsConclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow
communication
 SIMD machines are good for Signature files
 Both are out of favor now
Thank you!
Till May 17? 18?, 6 pm

Parallel and Distributed Information Retrieval System

  • 1.
    Special Topics inComputer ScienceSpecial Topics in Computer Science Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7Lecture 7 (book chapter 9)(book chapter 9):: Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh www.Gelbukh.com
  • 2.
    Previous Chapter: ConclusionsPreviousChapter: Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important
  • 3.
    Previous Chapter: ResearchtopicsPrevious Chapter: Research topics  Object detection (pattern and image recognition)  Automatic feature selection  Spatial indexing data structures (more than 1D)  New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ...
  • 4.
    The problemThe problem Very large document collections  Google: 4,000,000,000 pages  Slow response?  Solution: parallel computing  Google: 10,000 computers
  • 5.
    Parallel architecturesParallel architectures Datastream Single Multiple Instructionstream Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD
  • 6.
    MIMD architectureMIMD architecture The most common  Can be  tightly coupled  loosely coupled  Distributed  Many computers interacting via network  PC Clusters  Similar to MIMD computers, but greater cost of communication  very loosely coupled  More coarse-grained programs
  • 7.
    Performance improvementPerformance improvement Time:speedup S  Ideally, N times (number of processors)  In practice impossible  The problem does not decompose into N equal parts  Communication and control overhead  < 1 / f, where f is the largest separable fraction of the problem Cost  Per processor: S / N
  • 8.
    Two approaches toparallelismTwo approaches to parallelism  Build new algorithms  E.g., neural nets  Naturally parallel  Problem: to define the retrieval task  Adapt the existing techniques to parallelism  Allows relying on well-studied approaches  We will consider this option
  • 9.
    Ways to useparallelismWays to use parallelism  Multitasking  N search engines  Good for processing many queries Problems:  A single query is not speeded up  Bottleneck: disk access (index)  Possible solution: replicating (part of) data. RAIDs  Parallel algorithms  IR = data. Main question: how to partition the data  Document / index term matrix (terms can be LSI dimensions, signature bits, etc)
  • 10.
    Possible partitioningsPossible partitionings Horizontal: document partitioning. Union of results  Vertical: term partitioning. Basically, intersect results
  • 11.
    Inverted files: LogicalpartitioningInverted files: Logical partitioning  Logical vs. physical document partitioning  Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion
  • 12.
    Inverted files: LogicalpartitioningInverted files: Logical partitioning Construction and updatingConstruction and updating  Also parallel Construction  Assign docs to processors  Order docs such that each processor has an interval  Process in parallel  Merge. Each piece is ordered already
  • 13.
    Inverted files:Inverted files: Physicaldocument partitioningPhysical document partitioning  Several separate collections, one per processor  Separate indices  Then the lists are merged (they are already ordered)  Priority queue is used  The result is not sorted; Insertion is quick  The maximal element can be found quickly  First k elements can be found rather quickly  Details in the book  Consistent scores are needed  Global statistics is needed. Can be computed at index time
  • 14.
    Logical or physicalpartitioning?Logical or physical partitioning?  Logical requires less communication  Faster  Physical is more flexible. Simpler implementation  Simpler conversion of existing systems
  • 15.
    Inverted files:Inverted files:Term partitioningTerm partitioning  Each processor processes a part of the inverted file  The results are intersected (for AND)  (or as appropriate for Boolean operations, OR and NOT)  When term distribution in user queries is skewed, then document partitioning is better  When uniform, term partitioning is better.  Twice for long queries, 5 – 10 times for short (Web-like)
  • 16.
    Suffix arraysSuffix arrays Array construction can be parallelized  merges are parallel  Document partitioning is applied straightforwardly  Each processor maintains its own suffix array  Term partitioning can be applied  Each processor owns a branch of the tree (lexicographic interval)  Bottleneck: all processors need access to the entire text
  • 18.
    Signature filesSignature files Document partitioning: straightforward  Create query signature, distribute to each processor  Merge results (using Boolean operations if needed)  Term partitioning: shorter signatures  Merging and eliminating false drops is slow  This method is not recommended
  • 19.
    SIMD computersSIMD computers Single Instruction, Multiple data  Uncommon  Good for simple operations  Bit operations in signature files  Details in the book  Ranking is supported in hardware in some computers  If signature file does not fit into memory, can be processed in batches  I/O overhead  Use multiple queries with the same batch  This improves throughput, but not response time
  • 20.
    …… SIMD computersSIMDcomputers  Inverted files are difficult to adapt to SIMD  The inverted file is restructured  Details in the book
  • 21.
    Distributed IRDistributed IR MIMD with  Slow communication  Not all nodes are used for a given query  Encryption issues  Document partitioning is usually used  Term partitioning imposes greater communication overhead  Document clustering can be useful (to distribute docs by processors)  Index clusters and then search only the best ones  Another approach: use training queries, then similarity of the user query to these
  • 22.
    Research topicsResearch topics How to evaluate the speedup  New algorithms  Adaptation of existing algorithms  Merging the results is a bottleneck  Meta search engines  Creating large collections with judgements  Is recall important?
  • 23.
    ConclusionsConclusions  Parallel computingcan improve  response time for each query and/or  throughput: number of queries processed with same speed  Document partitioning is simple  good for distributed computing  Term partitioning is good for some data structures  Distributed computing is MIMD computing with slow communication  SIMD machines are good for Signature files  Both are out of favor now
  • 24.
    Thank you! Till May17? 18?, 6 pm