Block diagram reduction techniques in control systems.ppt
Parallel and Distributed Information Retrieval System
1. Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 7Lecture 7 (book chapter 9)(book chapter 9)::
Parallel and Distributed IRParallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
2. Previous Chapter: ConclusionsPrevious Chapter: Conclusions
How to accelerate search? Same results as sequential
Ideas:
Quick-and-dirty rejection of bad objects, 100% recall
Fast data structure for search (based on clustering)
Careful check of all found candidates
Solution: mapping into fewer-D feature space
Condition: lower-bounding of the distance
Assumption: skewed spectrum distribution
Few coefficients concentrate energy, rest are less important
3. Previous Chapter: Research topicsPrevious Chapter: Research topics
Object detection (pattern and image recognition)
Automatic feature selection
Spatial indexing data structures (more than 1D)
New types of data.
What features to select? How to determine them?
Mixed-type data (e.g., webpages, or images with
sound and description)
What clustering/IR methods are better suited for
what features? (What features for what methods?)
Similar methods in data mining, ...
4. The problemThe problem
Very large document collections
Google: 4,000,000,000 pages
Slow response?
Solution: parallel computing
Google: 10,000 computers
6. MIMD architectureMIMD architecture
The most common
Can be
tightly coupled
loosely coupled
Distributed
Many computers interacting via network
PC Clusters
Similar to MIMD computers, but greater cost of
communication
very loosely coupled
More coarse-grained programs
7. Performance improvementPerformance improvement
Time: speedup S
Ideally, N times (number of processors)
In practice impossible
The problem does not decompose into N equal parts
Communication and control overhead
< 1 / f, where f is the largest separable fraction of the
problem
Cost
Per processor: S / N
8. Two approaches to parallelismTwo approaches to parallelism
Build new algorithms
E.g., neural nets
Naturally parallel
Problem: to define the retrieval task
Adapt the existing techniques to parallelism
Allows relying on well-studied approaches
We will consider this option
9. Ways to use parallelismWays to use parallelism
Multitasking
N search engines
Good for processing many queries
Problems:
A single query is not speeded up
Bottleneck: disk access (index)
Possible solution: replicating (part of) data. RAIDs
Parallel algorithms
IR = data. Main question: how to partition the data
Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
11. Inverted files: Logical partitioningInverted files: Logical partitioning
Logical vs. physical document partitioning
Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
12. Inverted files: Logical partitioningInverted files: Logical partitioning
Construction and updatingConstruction and updating
Also parallel
Construction
Assign docs to processors
Order docs such that each processor has an interval
Process in parallel
Merge. Each piece is ordered already
13. Inverted files:Inverted files:
Physical document partitioningPhysical document partitioning
Several separate collections, one per processor
Separate indices
Then the lists are merged (they are already ordered)
Priority queue is used
The result is not sorted; Insertion is quick
The maximal element can be found quickly
First k elements can be found rather quickly
Details in the book
Consistent scores are needed
Global statistics is needed. Can be computed at index
time
14. Logical or physical partitioning?Logical or physical partitioning?
Logical requires less communication
Faster
Physical is more flexible. Simpler implementation
Simpler conversion of existing systems
15. Inverted files:Inverted files: Term partitioningTerm partitioning
Each processor processes a part of the inverted file
The results are intersected (for AND)
(or as appropriate for Boolean operations, OR and NOT)
When term distribution in user queries is skewed,
then document partitioning is better
When uniform, term partitioning is better.
Twice for long queries, 5 – 10 times for short (Web-like)
16. Suffix arraysSuffix arrays
Array construction can be parallelized
merges are parallel
Document partitioning is applied straightforwardly
Each processor maintains its own suffix array
Term partitioning can be applied
Each processor owns a branch of the tree (lexicographic
interval)
Bottleneck: all processors need access to the entire text
17.
18. Signature filesSignature files
Document partitioning: straightforward
Create query signature, distribute to each processor
Merge results (using Boolean operations if needed)
Term partitioning: shorter signatures
Merging and eliminating false drops is slow
This method is not recommended
19. SIMD computersSIMD computers
Single Instruction, Multiple data
Uncommon
Good for simple operations
Bit operations in signature files
Details in the book
Ranking is supported in hardware in some computers
If signature file does not fit into memory, can be
processed in batches
I/O overhead
Use multiple queries with the same batch
This improves throughput, but not response time
20. …… SIMD computersSIMD computers
Inverted files are difficult to adapt to SIMD
The inverted file is restructured
Details in the book
21. Distributed IRDistributed IR
MIMD with
Slow communication
Not all nodes are used for a given query
Encryption issues
Document partitioning is usually used
Term partitioning imposes greater communication
overhead
Document clustering can be useful (to distribute docs
by processors)
Index clusters and then search only the best ones
Another approach: use training queries, then similarity of
the user query to these
22. Research topicsResearch topics
How to evaluate the speedup
New algorithms
Adaptation of existing algorithms
Merging the results is a bottleneck
Meta search engines
Creating large collections with judgements
Is recall important?
23. ConclusionsConclusions
Parallel computing can improve
response time for each query and/or
throughput: number of queries processed with same speed
Document partitioning is simple
good for distributed computing
Term partitioning is good for some data structures
Distributed computing is MIMD computing with slow
communication
SIMD machines are good for Signature files
Both are out of favor now