SlideShare a Scribd company logo
1 of 18
Download to read offline
Linkoping Electronic Articles in
Computer and Information Science
Vol. 2(1997): nr 07
Linkoping University Electronic Press
Linkoping, Sweden
http://www.ep.liu.se/ea/cis/1997/007/
Batched Range Searching on a
Mesh-Connected SIMD
Computer
Per-Olof Fjallstrom
Department of Computer and Information Science
Linkoping University
Linkoping, Sweden
Published on July 7, 1997 by
Linkoping University Electronic Press
581 83 Linkoping, Sweden
Linkoping Electronic Articles in
Computer and Information Science
ISSN 1401-9841
Series editor: Erik Sandewall
c 1997 Per-Olof Fjallstrom
Typeset by the author using LATEX
Formatted using etendu style
Recommended citation:
<Author>. <Title>. Linkoping Electronic Articles in
Computer and Information Science, Vol. 2(1997): nr 07.
http://www.ep.liu.se/ea/cis/1997/007/. July 7, 1997.
This URL will also contain a link to the author's home page.
The publishers will keep this article on-line on the Internet
(or its possible replacement network in the future)
for a period of 25 years from the date of publication,
barring exceptional circumstances as described separately.
The on-line availability of the article implies
a permanent permission for anyone to read the article on-line,
and to print out single copies of it for personal use.
This permission can not be revoked by subsequent
transfers of copyright. All other uses of the article,
including for making copies for classroom use,
are conditional on the consent of the copyright owner.
The publication of the article on the date stated above
included also the production of a limited number of copies
on paper, which were archived in Swedish university libraries
like all other written works published in Sweden.
The publisher has taken technical and administrative measures
to assure that the on-line version of the article will be
permanently accessible using the URL stated above,
unchanged, and permanently equal to the archived printed copies
at least until the expiration of the publication period.
For additional information about the Linkoping University
Electronic Press and its procedures for publication and for
assurance of document integrity, please refer to
its WWW home page: http://www.ep.liu.se/
or by conventional mail to the address stated above.
Abstract
Given a set of n points and hyperrectangles in d-dimensional
space, the batched range-searching problem is to determine which
points each hyperrectangle contains. We present two parallel
algorithms for this problem on a pn pn mesh-connected paral-
lel computer: one average-case e cient algorithm based on cell
division, and one worst-case e cient divide-and-conquer algo-
rithm. Besides the asymptotic analysis of their running times,
we present an experimental evaluation of the algorithms.
Keywords Parallel algorithms, mesh-connected parallel comput-
ers, range searching.
The work presented here is funded by CENIIT (the Center for
Industrial Information Technology) at Linkoping University.
A shorter version of this report has been accepted for presentation at
the Ninth IASTED International Conference on Parallel and
Distributed Computing and Systems, October 13-16, 1997,
Washington D.C., USA.
1
1 Introduction
The batched range-searching problem is as follows. Given a set P of
points and a set Q of hyperrectangles in d-dimensional space, report,
for each hyperrectangle, which points it contains. (A hyperrectan-
gle is the Cartesian product of intervals on distinct coordinate axes.)
In on-line range searching, the hyperrectangles are given one at a
time. Several sequential range-searching algorithms have been pro-
posed 3, 10, 11]. Both on-line and batched range searching have
several important applications, for example in statistics, geographic
data processing, and computer-aided engineering. More speci cally,
we have identi ed batched range searching as an important subprob-
lem in computer simulation of mechanical deformation processes such
as vehicle collisions 4].
A two-dimensional mesh-connected parallel computer of size
pnpn consists of n identical processors organized in a rectangular array
of
pn rows and
pncolumns. A bidirectionalcommunication linkcon-
nects each pair of adjacent processors along the same row or column.
Due to the regular interconnection pattern, mesh-connected comput-
ers are inexpensive to build, and several such computers are on the
market. In an SIMD (Single Instruction, Multiple Data) computer,
the processors are synchronized and operate under the control of a
single program. Throughout this paper we refer to a mesh-connected
SIMD computer as a mesh. Many algorithms have been designed for
the mesh. For a survey of mesh algorithms for geometric problems,
see Atallah 2].
In this paper we describe and analyze two mesh algorithms for
batched range searching. One algorithm is based on an average-
case e cient sequential algorithm, whereas the other is a worst-case
e cient divide-and-conquer algorithm. We have implemented and
experimentally evaluated both of the algorithms. Our algorithms are
based on well-known techniques such as divide-and-conquer, but we
are not aware of any other mesh algorithms for range searching. (Oh
and Suk 9] present a mesh algorithm for the on-line version of the
range-counting problem. That is, their algorithm gives the number
of points contained in a hyperrectangle.)
In our development of range-searching algorithms for the mesh,
we assume that P and Q together have at most n elements, and that
each processor initially has at most one point or hyperrectangle in
its local memory. At the end of execution, the points contained in
a hyperrectangle must reside in the local memory of the processor
that initially contained the hyperrectangle. We assume also that the
number of points and the number of hyperrectangles are of the same
order of magnitude, and that the number of points contained in a
hyperrectangle is independent of n. These assumptions are valid in
many applications.
We organize the rest of the paper as follows. In the next section we
give some additional information concerning the mesh, and describe
2
some basic operations used by our algorithms. In Section 3 and 4, we
describe our mesh algorithms for batched range searching. In Section
5, we describe how we implemented the algorithms on a MasPar
MP-1, and report some experimental results. Section 6 o ers some
concluding remarks.
2 Preliminaries
As mentioned in the previous section, a single program controls the
mesh, that is, it is a Single Instruction, Multiple Data computer. In
its most rigid form, SIMD requires that all processors execute the
same instruction, and access data from the same address in their re-
spective memories. We relax these requirements as follows. First,
a processor may be either active or inactive, and an instruction is
executed only by active processors. Moreover, to be able to carry
out operations that require all processors to be active, we assume
that activating all processors temporarily is possible. Second, each
processor can do its own address computation. More speci cally, we
assume that processors simultaneously can execute an array index-
ing instruction such as A i] = b", where the value of i may di er
between processors. These features are all available in modern SIMD
computers such as the MasPar MP-1 computer.
Each processor is identi ed by its pair of row and column indexes,
(i;j), where 0 i;j < pn. In addition, processors are often indexed
by some one-to-one mapping from f0;1;:::;pn 1g f0;1;:::;pn
1g to f0;1;:::;n 1g. Various indexing schemes are used, for example
row-major, snake-like row-major, and shu ed row-major indexing.
In this paper we use snake-like row-major and shu ed row-major
indexing (see Figure 1). We assume that each processor knows its
indexes. The local memory of each processor consists of a xed num-
ber of memory cells (words). We assume that the size of a word is
su ciently large to contain a single coordinate value or processor in-
dex. The transfer of a word of data between adjacent processors and
the standard arithmetic operations on the contents of a word can be
done in O(1) time.
Sorting is one of the most important operations in parallel com-
putation. In many situations we need to rearrange a set of n keys,
one in each processor, such that the i-th smallest key is moved to the
processor with index i 1, for all i = 1;2;:::;n. Sorting can be done
in O(
pn) time 12, 7, 6].
Two other important data movement operations are concurrent
read and concurrent write. In a concurrent read operation, denoted
q = s(i):p, each processor i holds an index s(i) in its local memory.
The operation copies the data in memory cell p in the local memory of
the processor s(i) to memory cell q in the local memory of processor i.
In the concurrent write operation, denoted d(i):q = p, each processor
i holds a unique index d(i) in its local memory. The operation copies
3
0, 0
7, 2
15, 10
6, 3
2, 4 3, 5
4, 7
14, 11
0
1
2
3
0 1 2 3
i
j
11, 13
12, 1513, 14
10, 128, 8 9, 9
5, 6
1, 1
Figure 1: Mesh with n = 16. The rst integer within each processor
is the snake-like row-major index of the processor and the second
integer is the shu ed row-major index.
the content of memory location p in processor i's local memory to
location q in the processor d(i)'s local memory. Concurrent read and
write can be done in O(
pn) time 8].
Another fundamental operation is the global sum operation, for
which the input consists of an array a0;a1;:::;an 1], where ai is
contained in processor i. The output consists of the value of a0
a1 an 1 (where represents some associative binary opera-
tor such as +, maximum etc.), stored in the local memory of each
processor. Closely related to the global sum operation is the pre x
sum operation. It has the same input as the global sum operation,
but the output consists of the value of a0 a1 ai stored in the
local memory of processor i. We can compute both global and pre x
sums in O(
pn) time 1].
3 An Average-Case E cient Algorithm
The parallel algorithm presented in this section is based on the cell
method. This is one of the simplest sequential methods for range
searching. It consists of a preprocessing algorithm, the output of
which is a data structure on the given point set P, and a query
processing algorithm, that determines which points are contained in
a given hyperrectangle. The preprocessing algorithm is as follows.
First, nd the smallest hyperrectangular box that contains P. Par-
tition this box into a number of identical hyperrectangular cells, and
initialize a point list for each cell. Finally, for each point, determine
which cell it is contained in and add it to the cell's point list. To
determine which points are contained in a hyperrectangle q, deter-
4
mine which cells are intersected by q. For each intersected cell, nd
its point list and test each point in the list for inclusion in q.
Although its worst-case performance is poor, the sequential cell
method is quite e cient in practice. Intuitively, the reason for this is
that the input points are often uniformlydistributedover the smallest
hyperrectangle containing the points. If, in addition, the hyperrect-
angles are almost cubical", that is, not too long and thin, it can be
shown that the number of point inclusion tests and the number of
intersected cells are of the same order of magnitude as the number of
points contained in a hyperrectangle 11].
Before we give our parallel version of the cell method, we intro-
duce some notation used throughout this paper. The i-th coordinate,
i = 1;2;:::;d of point p is denoted by xi(p); the minimum and max-
imum coordinate values of hyperrectangle q in the i-th coordinate
direction are denoted by xl
i(q) and xu
i (q).
Algorithm: The Parallel Cell Method
Input: A set P of d-dimensional points and a set Q of d-dimensional
hyperrectangles are distributed on a
pn pn mesh, at most one
point or one hyperrectangle per processor. We index the mesh in
snake-like row-major order.
Output: For each q 2 Q, we store the points lying in the interior of q
in the processor containing q.
1. Compute B, the smallest hyperrectangle containing P. B is
divided into m cells along each coordinate direction, where
m = bn1=d
P c and nP = jPj. With each cell is associated a unique
d-tuple i1;i2;:::;id] such that 0 ik < m, for k = 1;2;:::;d.
For each cell is also de ned a unique processor index; the pro-
cessor index of a cell with d-tuple i1;i2;:::;id] is
Pd
k=1 ikmj 1.
We illustrate the cell subdivision in Figure 2.
2. For each processor, initialize the local variables rst and last
such that rst > last.
3. For each point p, determine rst i1(p);i2(p);:::;id(p)], the d-
tuple of the cell that contains p. We have that
ik(p) =
$
(m 1) xk(p) xl
k(B)
xu
k(B) xl
k(B)
%
; for k = 1;2;:::;d,
where xl
k(B) and xu
k(B) are the minimum and maximum coor-
dinate values of B in the k-th coordinate direction. Next, com-
pute c(p), the processorindexcorrespondingto i1(p);:::;id(p)].
4. For each point p, create the record G(p) = x1(p);:::;xd(p);c(p)].
Sort the records into nondecreasing order with respect to their
last component.
5. For each point p (from now on, point" refers to the d rst
components of a G record), do
5
(a) if c(pp) 6= c(p), then set c(p): rst = i(p), and
(b) if c(ps) 6= c(p), then set c(p):last = i(p),
where pp (ps) denotes the point that precedes (succeeds) p in
snake-like row-major order, and i(p) is the index of the proces-
sor containing p.
6. For each hyperrectangle q, do as follows.
(a) Determine the two d-tuples l1(q);:::;ld(q)] and u1(q);:::;
ud(q)], such that q intersects each cell i1;i2;:::;id] for
which lk(q) ik uk(q), for all k = 1;2;:::;d. Compute
s(q) = d
k=1sk(q), where sk(q) = uk(q) lk(q) + 1.
(b) If s(q) > 0, then do as follows.
i(q) = 1;
For k = 1;2;:::;d, compute
ik(q) = b(i(q) 1)= k 1
l=1 sl(q)c mod sk(q) + lk(q);
Compute c(q), the processor index of cell
i1(q);:::;id(q)];
j(q) = c(q):first;
last(q) = c(q):last;
L: if j(q) last(q) then
If the point in processor j(q) is contained in q, then
store a copy of it in the processor containing q;
j(q) = j(q) + 1;
if j(q) > last(q) and i(q) < s(q) then
i(q) = i(q) + 1;
For k = 1;2;:::;d, compute
ik(q) = b(i(q) 1)= k 1
l=1 sl(q)c mod sk(q) + lk(q);
Compute c(q), the processor index of cell
i1(q);:::;id(q)];
j(q) = c(q):first;
last(q) = c(q):last;
if j(q) last(q) then goto L;
Theorem 1. The parallel cell method takes O((d + rmax)
pn) time,
where rmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells
intersected by the hyperrectangle q and t(q) is the number of points
tested for inclusion in q.
Proof. Since we assume d to be much smaller than
pn, we re-
strict our analysis to operations that require communication between
processors. The rst ve steps of the algorithm correspond to the
preprocessing algorithm of the sequential cell method. In Step 1, B
and nP can be determined by the global sum operation; this takes a
total of O(d pn) time. In Step 4, we use dummy" records, that is,
we create a record for every processor. If a processor does not con-
tain a point, the last component of the record is set to +1". The
6
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
q
q
B
i
i
0 1 2 3
0
1
2
3
1
2
1
2
Figure 2: Two-dimensional example of a cell division for nP = 16.
Numbers within cells represent processor indexes, and dots represent
points. Two hyperrectangles, q1 and q2, are included in the example.
e ect of this is that, after sorting, all real" points are contained in
the processors indexed 0 through nP 1. Sorting the records requires
O(d pn) time. A point can compare its processor index with the pro-
cessor index of its predecessor and successor in O(1) time. The rst
and last variables can be set in O(
pn) time. Therefore, Step 5 takes
O(
pn) time. In Step 6(b), let a round be the activities taking place
between two consecutive executions of the rst if-statement. Clearly,
the number of rounds cannot exceed maxfs(q) + t(q) : q 2 Qg. 2
Corollary 1. If the points are chosen uniformly and independently
at random from the d-dimensional unit hypercube, and the hyperrect-
angles are cubical and fall completely within the unit hypercube, then
the average-case time for the parallel cell method is O((r+1)3d pn)),
where r is the average number of points contained in the largest hy-
perrectangle.
Proof. We assume that B is equal to the d-dimensional unit hyper-
cube. For su ciently large values of nP, this is a reasonable assump-
tion. Let w denote the width of the largest hyperrectangle. Then,
s(q) (dw me+ 1)d (dr1=de+ 1)d;
where r = wdnP is the average number of points contained in the
largest hyperrectangle. For r1=d 1, s(q) = 2d. Otherwise,
s(q) < (r1=d + 2)d < r 3d:
The average number of points per cell is O(1). 2
7
4 A Worst-Case E cient Algorithm
In this section, we present a worst-case e cient algorithm based on
divide-and-conquer. Many mesh algorithms are based on divide-and-
conquer. For example, Jeong and Lee 5] describe an algorithm to
solve a two-dimensional multipoint location problem that is based
on ideas similar to those used in our algorithm. Before we give the
actual algorithm, let us brie y describe the algorithm for the two-
dimensional case.
First, divide the input, P and Q, into two equal-sized parts, P1
and Q1, and P2 and Q2, such that each point in P1 and the lower
horizontal boundary of each hyperrectangle in Q1 lies below every
element in P2 and Q2. See Figure 3. Solve the corresponding sub-
problems recursively. We must now solve the problem for input P2
p
p
p
p
q
q
q
q
x
x
1
1
2
3
4
3
2
2
1
4
Figure 3: In this example, P1 = fp1g, Q1 = fq1;q2;q3g, P2 =
fp2;p3;p4g, and Q2 = fq4g.
and Q1. To this end, divide the input into two equal-sized parts, P1
and Q1, and P2 and Q2, such that each point in P2 and the upper
horizontal boundary of each hyperrectangle in Q2 lies above every
element in P1 and Q1. See Figure 4. Again, solve the corresponding
subproblems recursively. It remains to solve the problem for input
P1 and Q2. The dimension of this problem is, however, of one dimen-
sion less than the dimension of the original problem. If the problem is
one-dimensional, solving it directly is easy; otherwise, we again apply
divide-and-conquer.
Algorithm: Parallel Divide-and-Conquer
Input: A set P of d-dimensional points and a set Q of d-dimensional
hyperrectangles are distributed on a
pn pn mesh, at most one
point or one hyperrectangle per processor. We index the mesh in
shu ed row-major order, and we assume that
pn = 2k for some
positive integer k.
8
p
p
p
q
q
q
x
x
1
2
3
4
3
2
2
1
Figure 4: In this example, P1 = fp2;p3g, Q1 = fq2g, P2 = fp4g, and
Q2 = fq1;q3g.
Output: For each q 2 Q, we store the points lying in the interior of q
in the processor containing q.
1. Preprocessing:
For each point p, create the record Gd(p) = x1(p);:::;xd(p);a(p)],
where a(p) is called the address of p, i.e., a(p) is equal to the in-
dex of the processor containing p. Next, for each hyperrectangle
q, create the record
Gd(q) = xl
1(q);xu
1(q);xl
2(q);xu
2(q);:::;xl
d(q);xu
d(q);id(q)];
where id(q) is the index of the processor containing q. Finally,
sort all records into nondecreasing order with respect to xd-
coordinate (Gd(q) records are sorted with respect to their xl
d(q)-
coordinate).
2. Call range search(
pn;pn;d).
Procedurerange search (together withprocedurerange search )
does the main part of the computations. These procedures are
given below. The output from this step is, for each hyperrect-
angle q, a list of the addresses of the points contained in q. We
store this list in the processor that contains the corresponding
Gd(q) record.
3. Postprocessing:
For each processor containing a Gd(q) record, move the point
addresses stored in the processor to the processor that contains
q, that is, to the processor with index id(q). Then, for each
hyperrectangle q, process its list of point addresses. That is, for
each address in the list, copy the point stored at that address
to the processor containing q.
9
procedure range search(r;c;d)
for each submesh of size r c do in parallel
1. if r = c = 1 then return;
2. if d = 1 then
(a) For each processor, determine the index of the next pro-
cessor (in shu ed row-major order) that contains a G1(p)
record. Store the index in the local variable successor. (If
no such processor exists, then successor = NIL.)
(b) For each record G1(q) = xl
1(q);xu
1(q);i1(q)] do:
k(q) = successor;
while k(q) 6= NIL do
Let G1(p) = x1(p);a(p)] be the record in the
processor k(q);
if xu
1(q) > x1(p) then
copy a(p) to the processor containing G1(q);
k(q) = k(q):successor;
else exit the while-loop;
3. if d > 1 then
(a) if r = c then call range search(r=2;c;d).
(b) if r = c=2 then call range search(r;c=2;d).
(c) For each Gd(p) record in the higher-indexed half of the
submesh, create the record Gd(p) = Gd(p). For each Gd(q)
record in the lower-indexed half of the submesh, create the
record
Gd(q) = xl
1(q);xu
1(q);:::;xl
d 1(q);xu
d 1(q);xu
d(q);jd(q)];
where jd(q) is the index of the processor containing the
record. Finally, sort the Gd records into nondecreasing or-
der with respect to xd-coordinate (Gd(q) records are sorted
with respect to their xu
d(q)-coordinate).
(d) Call range search (r;c;d).
(e) For each processor containing a Gd(q) record, move the
point addresses stored in the processor during the call to
range search to the processor with index jd(q).
return
procedure range search (r;c;d)
for each submesh of size r c do in parallel
1. if r = c = 1 then return;
10
2. if r = c then call range search (r=2;c;d).
3. if r = c=2 then call range search (r;c=2;d).
4. For each Gd(p) record in the lower-indexed half of the submesh,
create the record
Gd 1(p) = x1(p);x2(p);:::;xd 1(p);a(p)]:
For each Gd(q) record in the higher-indexedhalf of the submesh,
create the record
Gd 1(q) = xl
1(q);xu
1(q);:::;xl
d 1(q);xu
d 1(q);id 1(q)];
where id 1(q) is the indexof the processor containing the record.
Finally, sort the Gd 1 records into nondecreasing order with re-
spect to xd 1-coordinate (the Gd 1(q) records are sorted with
respect to their xl
d 1(q)-coordinate).
5. Call range search(r;c;d 1).
6. For each processor containing a Gd 1(q) record, move the point
addressesstored inthe processor duringthe callto range search
to the processor with index id 1(q).
return
Theorem 2. The parallel divide-and-conquer method takes O((r +
1)16d pn) time, where r is the maximum number of points contained
in any hyperrectangle.
Proof. It is su cient to show that the number of routing steps
(i.e., the transfer of one word of data between adjacent processors) is
O((r+1)16d pn). In the Preprocessing step, records containing 2d+1
words of data are sorted; the number of routing steps is thus O(dpn).
The Postprocessing step requires r concurrent write operations and
d r concurrent read operations, which gives a total of O(d rpn)
routing steps.
To bound the number of routing steps done in procedures range
search and range search , we rst consider the number of routing
steps required when r = 0. Let the number of routing steps done by
procedure range search and range search on a mesh of size 2i 2j
be denoted by R(i;j;d) and R (i;j;d), respectively. We can easily
see that R(k;k;1) is O(2k); Step 2(a) is a variant of the pre x sum
operation, and in Step 2(b) at most one concurrent read operation is
required. Suppose now that d 2 and that k > 0; we then have the
recurrence relations
R(k;k;d) R(k 1;k;d) + 2dRs(k;k) + R (k;k;d); and
R (k;k;d) R (k 1;k;d) + (2d 1)Rs(k;k) + R(k;k;d 1);
11
where Rs(i;j) denotes the number of routing steps required to sort
numbers lying (one number per processor) in a mesh of size 2i 2j.
By expanding the rst term on the right-hand side of each inequality,
we get
R(k;k;d) 2d
kX
i=1
(Rs(i 1;i) + Rs(i;i)) +
kX
i=1
(R (i 1;i;d)
+ R (i;i;d))
4d
kX
i=1
Rs(i;i) + 2
kX
i=1
R (i;i;d);
and
R (k;k;d) 4d
kX
i=1
Rs(i;i) + 2
kX
i=1
R(i;i;d 1):
By inserting the last inequality into the equation for R(k;k;d), we
obtain
R(k;k;d) 4d
kX
i=1
Rs(i;i) + 8d
kX
i=1
(k + 1 i)Rs(i;i)
+ 4
kX
i=1
(k + 1 i)R(i;i;d 1)
= Rss(k;d) + 4
kX
i=1
(k + 1 i)R(i;i;d 1);
where we introduce Rss(k;d) to denote the value of the sums that
involves Rs. Expansion of this inequality, gives us
R(k;k;d) Rss(k;d) + 4
kX
i1=1
(k + 1 i1)Rss(i1;d 1)
+ 42
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)Rss(i2;d 2) +
+ 4d 2
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)
id 3X
id 2=1
(id 3 + 1 id 2)Rss(id 2;2)
+ 4d 1
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)
id 2X
id 1=1
(id 2 + 1 id 1)R(id 1;id 1;1):
To evaluate the left-hand side of this inequality, we note the fact that
kX
i=1
(k + 1 i)2i 42k:
12
This implies that Rss(k;d) is O(d2k). Moreover, by repeated use
of this fact we see that R(k;k;d) is O(2k Pd 1
i=0 (d i)16i), which is
O(16d 2k).
Let us nally consider how many additional routing steps are
required when r > 0. Let the number of additional routing steps
done by procedure range search and range search on a mesh of
size 2i 2j be denoted by R+(i;j;d) and R+(i;j;d), respectively. It
is then easy to see that R+(k;k;1) is O(r2k). For d 2 and k > 0,
we have the recurrence relations
R+(k;k;d) R+(k 1;k;d) + R+(k;k;d) + rRm(k;k); and
R+(k;k;d) R+(k 1;k;d) + R+(k;k;d 1) + rRm(k;k);
where Rm(i;j) denotes the number of routing steps required to move
numbers (one number per processor) from one set of processors to
another set of processors in a mesh of size 2i 2j. These recurrence
relations are similar to the previous recurrence relations. Together
with the fact that Rm(k;k) is O(2k), this implies that R+(k;d) is
O(r16d 2k). 2
5 Experimental Evaluation
So far we have described and analyzed our algorithms at a theoretical
level. To understand better how the algorithms work in practice, we
have also implemented them on a MasPar MP-1 computer.
The MasPar MP-1 consists of an array control unit and a proces-
sor array. The array control unit controls the processor array, and
the interaction between the front end computer and the processor ar-
ray. In addition, the array control unit performs operations on scalar
data. On the machine that we have access to, the processor array
consists of a total of 16,384 processors arranged in a two-dimensional
array of 128 rows and columns. Each processor in the processor ar-
ray is a 1.8-MIPS processor, and has forty 32-bit registers, and 16
or 64 kilobytes of RAM. Communication between two processors in
the processor array can be via X-net or Global Router, where X-net
communications are restricted to be either horizontal, vertical or di-
agonal. The Global Router allows communication between any pair
of processors but its e ciency is very data dependent: if many proces-
sors want to communicate with the same processor, the performance
deteriorates dramatically.
The MasPar can be programmed either by MPL or Fortran, where
MPL is based on ANSI C with extensions for data parallelism. Since
MPL allows direct control over the machine, we have used MPL.
As already mentioned in Section 2, MPL also o ers some degree of
exibility such as addressing autonomy.
In our implementation we have as much as possible used the li-
brary functions provided with MPL. More speci cally, library func-
tions have been for global sum operations in the parallel cell method.
13
We have consistently used the Global Router for concurrent read/write
operations. In only one case has this turned out to be problematic:
in Step 6(b) of the parallel cell method it may happen that many pro-
cessors rst copy the rst and last values from the same processor,
and then continue to copy point coordinates from the same proces-
sors. We avoid this by using randomization. To each hyperrectangle
q is associated a random number, r(q), such that 0 r(q) 1. This
number is used to modify the order in which cells and points are
processed by a hyperrectangle.
In both algorithms we need to sort records of data. There is no
library function for this, but in the parallel cell method we can sort by
using a ranking function, and then the Global Router to move each
record to its correct location. For each active processor, the ranking
function computes the rank of the value of a local variable. This
approach does not work in the parallel divide-and-conquer method,
since we then need simultaneously to sort records within submeshes.
We have instead implemented sorting routines based on bitonic sort
12].
We have evaluated the algorithms for two kindsof two-dimensional
input data. For both kinds of input there are 8192 points and 8192
equal-sized hypersquares. For uniform kind of input, points and
squares are chosen at random from the unit square. For diagonal
kind of input, points and squares are chosen at random along the di-
agonal of the unit square (i.e., the diagonal of each square coincides
with the diagonal of the unit square). The width of the squares is
in each case chosen such that each square contains four points on
average.
The running times (in milliseconds (ms)) are as follows. The
running time of the parallel cell method is 55 ms for uniform input,
and 205 ms for diagonal input. For the parallel divide-and-conquer
algorithm the corresponding running times are 1291 ms and 1207 ms.
The parallel cell method is thus much faster than the parallel
divide-and-conquer method for both kinds of input. Although the
diagonal kind of input is not a worst-case input for the cell method,
it is still fairly bad": on average we must test each square against
at least 90 points.
A substantial part of the running time of the parallel divide-and-
conquer algorithm is used for sorting records. The sorting algorithm
that we have implemented is asymptotically optimal, but it is likely
that a more careful implementation could make it run faster. It is
thus possible that we can improve the running time of the parallel
divide-and-conquer algorithm considerably.
To compare our algorithms with sequential algorithms for range
searching, we have implemented the sequential cell method on our
front end machine, a DECstation 5000 (Model 200, 25 MHz). The
parallel cell method is 13{15 times faster than the sequential cell
method for both kinds of input. The speedup is thus not very impres-
sive. Partly this is due to unavoidable communication costs. Another
14
reason is that the front end computer is more powerful in terms of
oating point operations: our measurements show that a point in-
clusion test (that is, to test if a point lies within a hyperrectangle)
is about forty times faster on the front end computer than on the
MasPar MP-1.
6 Conclusions
We have presented two algorithms for batched range searching on
a mesh: one algorithm based on cell division, and another algo-
rithm based on divide-and-conquer. The divide-and-conquer algo-
rithm takes O((r+1)16dpn) time, where r is the maximum number
of points contained in any hyperrectangle. We can show that if some
constant independent of n bounds r, and the points contained in a hy-
perrectangle must be stored in the processor that initially contained
the hyperrectangle, then any algorithm must take (d(r + 1)
pn)
time in the worst case. For a xed dimension d, the divide-and-
conquer algorithm is thus worst-case optimal (within a multiplicative
constant). The cell method takes O((d + wmax)
pn) time, where
wmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells inter-
sected by the hyperrectangle q and t(q) is the number of points tested
for inclusion in q. Thus, this method may take (dnpn) time even
when r = 0. However, as shown by Corollary 1 and our experimen-
tal results, the cell method may outperform the divide-and-conquer
method in practice.
We require that both algorithms store copies of the points con-
tained within a hyperrectangle in the processor that initially con-
tained the hyperrectangle. An alternative would have been to design
load-balanced algorithms, that store the output evenly distributed
over the processors. However, such algorithms would be forced to
spend time on various load-balancing activities. At least for the ap-
plications that we consider, it is likely that such algorithms would be
slower than the cell method.
References
1] S.G. Akl. The design and analysis of parallel algorithms.
Prentice-Hall International, London, UK, rst edition, 1989.
2] M.J. Atallah. Parallel techniques for computational geometry.
Proc. IEEE, 80(9):1435{1448, 1992.
3] J.L. Bentley and J.H. Friedman. Data structures for range
searching. Computing Surveys, 11:397{409, 1979.
4] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval-
uation of range searching methods for contact searching in me-
chanical engineering. To appear in Int. J. Computational Geom-
etry & Applications.
15
5] C.S. Jeong and D.T. Lee. Parallel geometric algorithms on a
mesh-connected computer. Algorithmica, 5:155{177, 1990.
6] M. Kumar and D.S. Hirschberg. An e cient implementation
of Batcher's odd-even merge algorithm and its application in
parallel sorting schemes. IEEE Transactions on Computers, C-
32(3):254{264, March 1983.
7] D. Nassimi and S. Sahni. Bitonic sort on a mesh-connected
parallel computer. IEEE Transactions on Computers, C-28(1):2{
7, January 1979.
8] D. Nassimi and S. Sahni. Data broadcasting in SIMD computers.
IEEE Transactions on Computers, C-30(2):101{107, February
1981.
9] S-J. Oh and M. Suk. Parallel algorithms for geometric searching
problems. In Proc. Supercomputing'89, pages 344{350, 1989.
10] F.P Preparata and M.I. Shamos. Computational geometry: An
Introduction. Springer-Verlag, New York, NY, second edition,
1985.
11] R. Sedgewick. Algorithms. Addison-Wesley Publishing Compa-
ny, Reading, MA, second edition, 1988.
12] C.D. Thompson and H.T. Kung. Sorting on a mesh-connected
parallel computer. Communications of the ACM, 20(4):263{271,
1977.

More Related Content

What's hot

Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platformsVajira Thambawita
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel ProcessingRTigger
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...IJECEIAES
 
INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGGS Kosta
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machinesSyed Zaid Irshad
 
P Systems and Distributed Computing
P Systems and Distributed ComputingP Systems and Distributed Computing
P Systems and Distributed ComputingApostolos Syropoulos
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...Journal For Research
 
IRJET- Review Paper on Study of Various Interleavers and their Significance
IRJET- Review Paper on Study of Various Interleavers and their SignificanceIRJET- Review Paper on Study of Various Interleavers and their Significance
IRJET- Review Paper on Study of Various Interleavers and their SignificanceIRJET Journal
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecturedileesh E D
 
Parallel processing coa
Parallel processing coaParallel processing coa
Parallel processing coaBala Vignesh
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 

What's hot (20)

Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
Reversed-Trellis Tail-Biting Convolutional Code (RT-TBCC) Decoder Architectur...
 
Aca2 08 new
Aca2 08 newAca2 08 new
Aca2 08 new
 
INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSING
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machines
 
Aca2 09 new
Aca2 09 newAca2 09 new
Aca2 09 new
 
P Systems and Distributed Computing
P Systems and Distributed ComputingP Systems and Distributed Computing
P Systems and Distributed Computing
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Modern processors
Modern processorsModern processors
Modern processors
 
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...
DESIGN AND VHDL IMPLEMENTATION OF 64-POINT FFT USING TWO STRUCTURE 8-POINT FF...
 
IRJET- Review Paper on Study of Various Interleavers and their Significance
IRJET- Review Paper on Study of Various Interleavers and their SignificanceIRJET- Review Paper on Study of Various Interleavers and their Significance
IRJET- Review Paper on Study of Various Interleavers and their Significance
 
Gn3311521155
Gn3311521155Gn3311521155
Gn3311521155
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecture
 
Parallel processing coa
Parallel processing coaParallel processing coa
Parallel processing coa
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 

Viewers also liked

Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersperfj
 
Disaster Risk Reduction and Management and Earthquake Preparedness - Davao
Disaster Risk Reduction and Management and Earthquake Preparedness - DavaoDisaster Risk Reduction and Management and Earthquake Preparedness - Davao
Disaster Risk Reduction and Management and Earthquake Preparedness - Davaojhaymz02
 
The Barangay Disaster Risk Reduction Management Plan
The Barangay Disaster Risk Reduction Management PlanThe Barangay Disaster Risk Reduction Management Plan
The Barangay Disaster Risk Reduction Management PlanBarangay Hall
 
Disaster Risk Reduction and Management
Disaster Risk Reduction and ManagementDisaster Risk Reduction and Management
Disaster Risk Reduction and ManagementRyann Castro
 

Viewers also liked (7)

Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
 
Earthquakes
EarthquakesEarthquakes
Earthquakes
 
Disaster Risk Reduction and Management and Earthquake Preparedness - Davao
Disaster Risk Reduction and Management and Earthquake Preparedness - DavaoDisaster Risk Reduction and Management and Earthquake Preparedness - Davao
Disaster Risk Reduction and Management and Earthquake Preparedness - Davao
 
Earthquake and its hazards
Earthquake and its hazardsEarthquake and its hazards
Earthquake and its hazards
 
The Barangay Disaster Risk Reduction Management Plan
The Barangay Disaster Risk Reduction Management PlanThe Barangay Disaster Risk Reduction Management Plan
The Barangay Disaster Risk Reduction Management Plan
 
Seismic waveanimations braile copy
Seismic waveanimations braile copySeismic waveanimations braile copy
Seismic waveanimations braile copy
 
Disaster Risk Reduction and Management
Disaster Risk Reduction and ManagementDisaster Risk Reduction and Management
Disaster Risk Reduction and Management
 

Similar to cis97007

cis97003
cis97003cis97003
cis97003perfj
 
Complier design
Complier design Complier design
Complier design shreeuva
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...IJECEIAES
 
cis98006
cis98006cis98006
cis98006perfj
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsJigisha Aryya
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
International Journal of Computational Engineering Research (IJCER)
International Journal of Computational Engineering Research (IJCER) International Journal of Computational Engineering Research (IJCER)
International Journal of Computational Engineering Research (IJCER) ijceronline
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP IJCSEIT Journal
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 

Similar to cis97007 (20)

cis97003
cis97003cis97003
cis97003
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Complier design
Complier design Complier design
Complier design
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...
 
cis98006
cis98006cis98006
cis98006
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systems
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
shashank_mascots1996_00501002
shashank_mascots1996_00501002shashank_mascots1996_00501002
shashank_mascots1996_00501002
 
International Journal of Computational Engineering Research (IJCER)
International Journal of Computational Engineering Research (IJCER) International Journal of Computational Engineering Research (IJCER)
International Journal of Computational Engineering Research (IJCER)
 
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
D031201021027
D031201021027D031201021027
D031201021027
 

cis97007

  • 1. Linkoping Electronic Articles in Computer and Information Science Vol. 2(1997): nr 07 Linkoping University Electronic Press Linkoping, Sweden http://www.ep.liu.se/ea/cis/1997/007/ Batched Range Searching on a Mesh-Connected SIMD Computer Per-Olof Fjallstrom Department of Computer and Information Science Linkoping University Linkoping, Sweden
  • 2. Published on July 7, 1997 by Linkoping University Electronic Press 581 83 Linkoping, Sweden Linkoping Electronic Articles in Computer and Information Science ISSN 1401-9841 Series editor: Erik Sandewall c 1997 Per-Olof Fjallstrom Typeset by the author using LATEX Formatted using etendu style Recommended citation: <Author>. <Title>. Linkoping Electronic Articles in Computer and Information Science, Vol. 2(1997): nr 07. http://www.ep.liu.se/ea/cis/1997/007/. July 7, 1997. This URL will also contain a link to the author's home page. The publishers will keep this article on-line on the Internet (or its possible replacement network in the future) for a period of 25 years from the date of publication, barring exceptional circumstances as described separately. The on-line availability of the article implies a permanent permission for anyone to read the article on-line, and to print out single copies of it for personal use. This permission can not be revoked by subsequent transfers of copyright. All other uses of the article, including for making copies for classroom use, are conditional on the consent of the copyright owner. The publication of the article on the date stated above included also the production of a limited number of copies on paper, which were archived in Swedish university libraries like all other written works published in Sweden. The publisher has taken technical and administrative measures to assure that the on-line version of the article will be permanently accessible using the URL stated above, unchanged, and permanently equal to the archived printed copies at least until the expiration of the publication period. For additional information about the Linkoping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/ or by conventional mail to the address stated above.
  • 3. Abstract Given a set of n points and hyperrectangles in d-dimensional space, the batched range-searching problem is to determine which points each hyperrectangle contains. We present two parallel algorithms for this problem on a pn pn mesh-connected paral- lel computer: one average-case e cient algorithm based on cell division, and one worst-case e cient divide-and-conquer algo- rithm. Besides the asymptotic analysis of their running times, we present an experimental evaluation of the algorithms. Keywords Parallel algorithms, mesh-connected parallel comput- ers, range searching. The work presented here is funded by CENIIT (the Center for Industrial Information Technology) at Linkoping University. A shorter version of this report has been accepted for presentation at the Ninth IASTED International Conference on Parallel and Distributed Computing and Systems, October 13-16, 1997, Washington D.C., USA.
  • 4. 1 1 Introduction The batched range-searching problem is as follows. Given a set P of points and a set Q of hyperrectangles in d-dimensional space, report, for each hyperrectangle, which points it contains. (A hyperrectan- gle is the Cartesian product of intervals on distinct coordinate axes.) In on-line range searching, the hyperrectangles are given one at a time. Several sequential range-searching algorithms have been pro- posed 3, 10, 11]. Both on-line and batched range searching have several important applications, for example in statistics, geographic data processing, and computer-aided engineering. More speci cally, we have identi ed batched range searching as an important subprob- lem in computer simulation of mechanical deformation processes such as vehicle collisions 4]. A two-dimensional mesh-connected parallel computer of size pnpn consists of n identical processors organized in a rectangular array of pn rows and pncolumns. A bidirectionalcommunication linkcon- nects each pair of adjacent processors along the same row or column. Due to the regular interconnection pattern, mesh-connected comput- ers are inexpensive to build, and several such computers are on the market. In an SIMD (Single Instruction, Multiple Data) computer, the processors are synchronized and operate under the control of a single program. Throughout this paper we refer to a mesh-connected SIMD computer as a mesh. Many algorithms have been designed for the mesh. For a survey of mesh algorithms for geometric problems, see Atallah 2]. In this paper we describe and analyze two mesh algorithms for batched range searching. One algorithm is based on an average- case e cient sequential algorithm, whereas the other is a worst-case e cient divide-and-conquer algorithm. We have implemented and experimentally evaluated both of the algorithms. Our algorithms are based on well-known techniques such as divide-and-conquer, but we are not aware of any other mesh algorithms for range searching. (Oh and Suk 9] present a mesh algorithm for the on-line version of the range-counting problem. That is, their algorithm gives the number of points contained in a hyperrectangle.) In our development of range-searching algorithms for the mesh, we assume that P and Q together have at most n elements, and that each processor initially has at most one point or hyperrectangle in its local memory. At the end of execution, the points contained in a hyperrectangle must reside in the local memory of the processor that initially contained the hyperrectangle. We assume also that the number of points and the number of hyperrectangles are of the same order of magnitude, and that the number of points contained in a hyperrectangle is independent of n. These assumptions are valid in many applications. We organize the rest of the paper as follows. In the next section we give some additional information concerning the mesh, and describe
  • 5. 2 some basic operations used by our algorithms. In Section 3 and 4, we describe our mesh algorithms for batched range searching. In Section 5, we describe how we implemented the algorithms on a MasPar MP-1, and report some experimental results. Section 6 o ers some concluding remarks. 2 Preliminaries As mentioned in the previous section, a single program controls the mesh, that is, it is a Single Instruction, Multiple Data computer. In its most rigid form, SIMD requires that all processors execute the same instruction, and access data from the same address in their re- spective memories. We relax these requirements as follows. First, a processor may be either active or inactive, and an instruction is executed only by active processors. Moreover, to be able to carry out operations that require all processors to be active, we assume that activating all processors temporarily is possible. Second, each processor can do its own address computation. More speci cally, we assume that processors simultaneously can execute an array index- ing instruction such as A i] = b", where the value of i may di er between processors. These features are all available in modern SIMD computers such as the MasPar MP-1 computer. Each processor is identi ed by its pair of row and column indexes, (i;j), where 0 i;j < pn. In addition, processors are often indexed by some one-to-one mapping from f0;1;:::;pn 1g f0;1;:::;pn 1g to f0;1;:::;n 1g. Various indexing schemes are used, for example row-major, snake-like row-major, and shu ed row-major indexing. In this paper we use snake-like row-major and shu ed row-major indexing (see Figure 1). We assume that each processor knows its indexes. The local memory of each processor consists of a xed num- ber of memory cells (words). We assume that the size of a word is su ciently large to contain a single coordinate value or processor in- dex. The transfer of a word of data between adjacent processors and the standard arithmetic operations on the contents of a word can be done in O(1) time. Sorting is one of the most important operations in parallel com- putation. In many situations we need to rearrange a set of n keys, one in each processor, such that the i-th smallest key is moved to the processor with index i 1, for all i = 1;2;:::;n. Sorting can be done in O( pn) time 12, 7, 6]. Two other important data movement operations are concurrent read and concurrent write. In a concurrent read operation, denoted q = s(i):p, each processor i holds an index s(i) in its local memory. The operation copies the data in memory cell p in the local memory of the processor s(i) to memory cell q in the local memory of processor i. In the concurrent write operation, denoted d(i):q = p, each processor i holds a unique index d(i) in its local memory. The operation copies
  • 6. 3 0, 0 7, 2 15, 10 6, 3 2, 4 3, 5 4, 7 14, 11 0 1 2 3 0 1 2 3 i j 11, 13 12, 1513, 14 10, 128, 8 9, 9 5, 6 1, 1 Figure 1: Mesh with n = 16. The rst integer within each processor is the snake-like row-major index of the processor and the second integer is the shu ed row-major index. the content of memory location p in processor i's local memory to location q in the processor d(i)'s local memory. Concurrent read and write can be done in O( pn) time 8]. Another fundamental operation is the global sum operation, for which the input consists of an array a0;a1;:::;an 1], where ai is contained in processor i. The output consists of the value of a0 a1 an 1 (where represents some associative binary opera- tor such as +, maximum etc.), stored in the local memory of each processor. Closely related to the global sum operation is the pre x sum operation. It has the same input as the global sum operation, but the output consists of the value of a0 a1 ai stored in the local memory of processor i. We can compute both global and pre x sums in O( pn) time 1]. 3 An Average-Case E cient Algorithm The parallel algorithm presented in this section is based on the cell method. This is one of the simplest sequential methods for range searching. It consists of a preprocessing algorithm, the output of which is a data structure on the given point set P, and a query processing algorithm, that determines which points are contained in a given hyperrectangle. The preprocessing algorithm is as follows. First, nd the smallest hyperrectangular box that contains P. Par- tition this box into a number of identical hyperrectangular cells, and initialize a point list for each cell. Finally, for each point, determine which cell it is contained in and add it to the cell's point list. To determine which points are contained in a hyperrectangle q, deter-
  • 7. 4 mine which cells are intersected by q. For each intersected cell, nd its point list and test each point in the list for inclusion in q. Although its worst-case performance is poor, the sequential cell method is quite e cient in practice. Intuitively, the reason for this is that the input points are often uniformlydistributedover the smallest hyperrectangle containing the points. If, in addition, the hyperrect- angles are almost cubical", that is, not too long and thin, it can be shown that the number of point inclusion tests and the number of intersected cells are of the same order of magnitude as the number of points contained in a hyperrectangle 11]. Before we give our parallel version of the cell method, we intro- duce some notation used throughout this paper. The i-th coordinate, i = 1;2;:::;d of point p is denoted by xi(p); the minimum and max- imum coordinate values of hyperrectangle q in the i-th coordinate direction are denoted by xl i(q) and xu i (q). Algorithm: The Parallel Cell Method Input: A set P of d-dimensional points and a set Q of d-dimensional hyperrectangles are distributed on a pn pn mesh, at most one point or one hyperrectangle per processor. We index the mesh in snake-like row-major order. Output: For each q 2 Q, we store the points lying in the interior of q in the processor containing q. 1. Compute B, the smallest hyperrectangle containing P. B is divided into m cells along each coordinate direction, where m = bn1=d P c and nP = jPj. With each cell is associated a unique d-tuple i1;i2;:::;id] such that 0 ik < m, for k = 1;2;:::;d. For each cell is also de ned a unique processor index; the pro- cessor index of a cell with d-tuple i1;i2;:::;id] is Pd k=1 ikmj 1. We illustrate the cell subdivision in Figure 2. 2. For each processor, initialize the local variables rst and last such that rst > last. 3. For each point p, determine rst i1(p);i2(p);:::;id(p)], the d- tuple of the cell that contains p. We have that ik(p) = $ (m 1) xk(p) xl k(B) xu k(B) xl k(B) % ; for k = 1;2;:::;d, where xl k(B) and xu k(B) are the minimum and maximum coor- dinate values of B in the k-th coordinate direction. Next, com- pute c(p), the processorindexcorrespondingto i1(p);:::;id(p)]. 4. For each point p, create the record G(p) = x1(p);:::;xd(p);c(p)]. Sort the records into nondecreasing order with respect to their last component. 5. For each point p (from now on, point" refers to the d rst components of a G record), do
  • 8. 5 (a) if c(pp) 6= c(p), then set c(p): rst = i(p), and (b) if c(ps) 6= c(p), then set c(p):last = i(p), where pp (ps) denotes the point that precedes (succeeds) p in snake-like row-major order, and i(p) is the index of the proces- sor containing p. 6. For each hyperrectangle q, do as follows. (a) Determine the two d-tuples l1(q);:::;ld(q)] and u1(q);:::; ud(q)], such that q intersects each cell i1;i2;:::;id] for which lk(q) ik uk(q), for all k = 1;2;:::;d. Compute s(q) = d k=1sk(q), where sk(q) = uk(q) lk(q) + 1. (b) If s(q) > 0, then do as follows. i(q) = 1; For k = 1;2;:::;d, compute ik(q) = b(i(q) 1)= k 1 l=1 sl(q)c mod sk(q) + lk(q); Compute c(q), the processor index of cell i1(q);:::;id(q)]; j(q) = c(q):first; last(q) = c(q):last; L: if j(q) last(q) then If the point in processor j(q) is contained in q, then store a copy of it in the processor containing q; j(q) = j(q) + 1; if j(q) > last(q) and i(q) < s(q) then i(q) = i(q) + 1; For k = 1;2;:::;d, compute ik(q) = b(i(q) 1)= k 1 l=1 sl(q)c mod sk(q) + lk(q); Compute c(q), the processor index of cell i1(q);:::;id(q)]; j(q) = c(q):first; last(q) = c(q):last; if j(q) last(q) then goto L; Theorem 1. The parallel cell method takes O((d + rmax) pn) time, where rmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells intersected by the hyperrectangle q and t(q) is the number of points tested for inclusion in q. Proof. Since we assume d to be much smaller than pn, we re- strict our analysis to operations that require communication between processors. The rst ve steps of the algorithm correspond to the preprocessing algorithm of the sequential cell method. In Step 1, B and nP can be determined by the global sum operation; this takes a total of O(d pn) time. In Step 4, we use dummy" records, that is, we create a record for every processor. If a processor does not con- tain a point, the last component of the record is set to +1". The
  • 9. 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 q q B i i 0 1 2 3 0 1 2 3 1 2 1 2 Figure 2: Two-dimensional example of a cell division for nP = 16. Numbers within cells represent processor indexes, and dots represent points. Two hyperrectangles, q1 and q2, are included in the example. e ect of this is that, after sorting, all real" points are contained in the processors indexed 0 through nP 1. Sorting the records requires O(d pn) time. A point can compare its processor index with the pro- cessor index of its predecessor and successor in O(1) time. The rst and last variables can be set in O( pn) time. Therefore, Step 5 takes O( pn) time. In Step 6(b), let a round be the activities taking place between two consecutive executions of the rst if-statement. Clearly, the number of rounds cannot exceed maxfs(q) + t(q) : q 2 Qg. 2 Corollary 1. If the points are chosen uniformly and independently at random from the d-dimensional unit hypercube, and the hyperrect- angles are cubical and fall completely within the unit hypercube, then the average-case time for the parallel cell method is O((r+1)3d pn)), where r is the average number of points contained in the largest hy- perrectangle. Proof. We assume that B is equal to the d-dimensional unit hyper- cube. For su ciently large values of nP, this is a reasonable assump- tion. Let w denote the width of the largest hyperrectangle. Then, s(q) (dw me+ 1)d (dr1=de+ 1)d; where r = wdnP is the average number of points contained in the largest hyperrectangle. For r1=d 1, s(q) = 2d. Otherwise, s(q) < (r1=d + 2)d < r 3d: The average number of points per cell is O(1). 2
  • 10. 7 4 A Worst-Case E cient Algorithm In this section, we present a worst-case e cient algorithm based on divide-and-conquer. Many mesh algorithms are based on divide-and- conquer. For example, Jeong and Lee 5] describe an algorithm to solve a two-dimensional multipoint location problem that is based on ideas similar to those used in our algorithm. Before we give the actual algorithm, let us brie y describe the algorithm for the two- dimensional case. First, divide the input, P and Q, into two equal-sized parts, P1 and Q1, and P2 and Q2, such that each point in P1 and the lower horizontal boundary of each hyperrectangle in Q1 lies below every element in P2 and Q2. See Figure 3. Solve the corresponding sub- problems recursively. We must now solve the problem for input P2 p p p p q q q q x x 1 1 2 3 4 3 2 2 1 4 Figure 3: In this example, P1 = fp1g, Q1 = fq1;q2;q3g, P2 = fp2;p3;p4g, and Q2 = fq4g. and Q1. To this end, divide the input into two equal-sized parts, P1 and Q1, and P2 and Q2, such that each point in P2 and the upper horizontal boundary of each hyperrectangle in Q2 lies above every element in P1 and Q1. See Figure 4. Again, solve the corresponding subproblems recursively. It remains to solve the problem for input P1 and Q2. The dimension of this problem is, however, of one dimen- sion less than the dimension of the original problem. If the problem is one-dimensional, solving it directly is easy; otherwise, we again apply divide-and-conquer. Algorithm: Parallel Divide-and-Conquer Input: A set P of d-dimensional points and a set Q of d-dimensional hyperrectangles are distributed on a pn pn mesh, at most one point or one hyperrectangle per processor. We index the mesh in shu ed row-major order, and we assume that pn = 2k for some positive integer k.
  • 11. 8 p p p q q q x x 1 2 3 4 3 2 2 1 Figure 4: In this example, P1 = fp2;p3g, Q1 = fq2g, P2 = fp4g, and Q2 = fq1;q3g. Output: For each q 2 Q, we store the points lying in the interior of q in the processor containing q. 1. Preprocessing: For each point p, create the record Gd(p) = x1(p);:::;xd(p);a(p)], where a(p) is called the address of p, i.e., a(p) is equal to the in- dex of the processor containing p. Next, for each hyperrectangle q, create the record Gd(q) = xl 1(q);xu 1(q);xl 2(q);xu 2(q);:::;xl d(q);xu d(q);id(q)]; where id(q) is the index of the processor containing q. Finally, sort all records into nondecreasing order with respect to xd- coordinate (Gd(q) records are sorted with respect to their xl d(q)- coordinate). 2. Call range search( pn;pn;d). Procedurerange search (together withprocedurerange search ) does the main part of the computations. These procedures are given below. The output from this step is, for each hyperrect- angle q, a list of the addresses of the points contained in q. We store this list in the processor that contains the corresponding Gd(q) record. 3. Postprocessing: For each processor containing a Gd(q) record, move the point addresses stored in the processor to the processor that contains q, that is, to the processor with index id(q). Then, for each hyperrectangle q, process its list of point addresses. That is, for each address in the list, copy the point stored at that address to the processor containing q.
  • 12. 9 procedure range search(r;c;d) for each submesh of size r c do in parallel 1. if r = c = 1 then return; 2. if d = 1 then (a) For each processor, determine the index of the next pro- cessor (in shu ed row-major order) that contains a G1(p) record. Store the index in the local variable successor. (If no such processor exists, then successor = NIL.) (b) For each record G1(q) = xl 1(q);xu 1(q);i1(q)] do: k(q) = successor; while k(q) 6= NIL do Let G1(p) = x1(p);a(p)] be the record in the processor k(q); if xu 1(q) > x1(p) then copy a(p) to the processor containing G1(q); k(q) = k(q):successor; else exit the while-loop; 3. if d > 1 then (a) if r = c then call range search(r=2;c;d). (b) if r = c=2 then call range search(r;c=2;d). (c) For each Gd(p) record in the higher-indexed half of the submesh, create the record Gd(p) = Gd(p). For each Gd(q) record in the lower-indexed half of the submesh, create the record Gd(q) = xl 1(q);xu 1(q);:::;xl d 1(q);xu d 1(q);xu d(q);jd(q)]; where jd(q) is the index of the processor containing the record. Finally, sort the Gd records into nondecreasing or- der with respect to xd-coordinate (Gd(q) records are sorted with respect to their xu d(q)-coordinate). (d) Call range search (r;c;d). (e) For each processor containing a Gd(q) record, move the point addresses stored in the processor during the call to range search to the processor with index jd(q). return procedure range search (r;c;d) for each submesh of size r c do in parallel 1. if r = c = 1 then return;
  • 13. 10 2. if r = c then call range search (r=2;c;d). 3. if r = c=2 then call range search (r;c=2;d). 4. For each Gd(p) record in the lower-indexed half of the submesh, create the record Gd 1(p) = x1(p);x2(p);:::;xd 1(p);a(p)]: For each Gd(q) record in the higher-indexedhalf of the submesh, create the record Gd 1(q) = xl 1(q);xu 1(q);:::;xl d 1(q);xu d 1(q);id 1(q)]; where id 1(q) is the indexof the processor containing the record. Finally, sort the Gd 1 records into nondecreasing order with re- spect to xd 1-coordinate (the Gd 1(q) records are sorted with respect to their xl d 1(q)-coordinate). 5. Call range search(r;c;d 1). 6. For each processor containing a Gd 1(q) record, move the point addressesstored inthe processor duringthe callto range search to the processor with index id 1(q). return Theorem 2. The parallel divide-and-conquer method takes O((r + 1)16d pn) time, where r is the maximum number of points contained in any hyperrectangle. Proof. It is su cient to show that the number of routing steps (i.e., the transfer of one word of data between adjacent processors) is O((r+1)16d pn). In the Preprocessing step, records containing 2d+1 words of data are sorted; the number of routing steps is thus O(dpn). The Postprocessing step requires r concurrent write operations and d r concurrent read operations, which gives a total of O(d rpn) routing steps. To bound the number of routing steps done in procedures range search and range search , we rst consider the number of routing steps required when r = 0. Let the number of routing steps done by procedure range search and range search on a mesh of size 2i 2j be denoted by R(i;j;d) and R (i;j;d), respectively. We can easily see that R(k;k;1) is O(2k); Step 2(a) is a variant of the pre x sum operation, and in Step 2(b) at most one concurrent read operation is required. Suppose now that d 2 and that k > 0; we then have the recurrence relations R(k;k;d) R(k 1;k;d) + 2dRs(k;k) + R (k;k;d); and R (k;k;d) R (k 1;k;d) + (2d 1)Rs(k;k) + R(k;k;d 1);
  • 14. 11 where Rs(i;j) denotes the number of routing steps required to sort numbers lying (one number per processor) in a mesh of size 2i 2j. By expanding the rst term on the right-hand side of each inequality, we get R(k;k;d) 2d kX i=1 (Rs(i 1;i) + Rs(i;i)) + kX i=1 (R (i 1;i;d) + R (i;i;d)) 4d kX i=1 Rs(i;i) + 2 kX i=1 R (i;i;d); and R (k;k;d) 4d kX i=1 Rs(i;i) + 2 kX i=1 R(i;i;d 1): By inserting the last inequality into the equation for R(k;k;d), we obtain R(k;k;d) 4d kX i=1 Rs(i;i) + 8d kX i=1 (k + 1 i)Rs(i;i) + 4 kX i=1 (k + 1 i)R(i;i;d 1) = Rss(k;d) + 4 kX i=1 (k + 1 i)R(i;i;d 1); where we introduce Rss(k;d) to denote the value of the sums that involves Rs. Expansion of this inequality, gives us R(k;k;d) Rss(k;d) + 4 kX i1=1 (k + 1 i1)Rss(i1;d 1) + 42 kX i1=1 (k + 1 i1) i1X i2=1 (i1 + 1 i2)Rss(i2;d 2) + + 4d 2 kX i1=1 (k + 1 i1) i1X i2=1 (i1 + 1 i2) id 3X id 2=1 (id 3 + 1 id 2)Rss(id 2;2) + 4d 1 kX i1=1 (k + 1 i1) i1X i2=1 (i1 + 1 i2) id 2X id 1=1 (id 2 + 1 id 1)R(id 1;id 1;1): To evaluate the left-hand side of this inequality, we note the fact that kX i=1 (k + 1 i)2i 42k:
  • 15. 12 This implies that Rss(k;d) is O(d2k). Moreover, by repeated use of this fact we see that R(k;k;d) is O(2k Pd 1 i=0 (d i)16i), which is O(16d 2k). Let us nally consider how many additional routing steps are required when r > 0. Let the number of additional routing steps done by procedure range search and range search on a mesh of size 2i 2j be denoted by R+(i;j;d) and R+(i;j;d), respectively. It is then easy to see that R+(k;k;1) is O(r2k). For d 2 and k > 0, we have the recurrence relations R+(k;k;d) R+(k 1;k;d) + R+(k;k;d) + rRm(k;k); and R+(k;k;d) R+(k 1;k;d) + R+(k;k;d 1) + rRm(k;k); where Rm(i;j) denotes the number of routing steps required to move numbers (one number per processor) from one set of processors to another set of processors in a mesh of size 2i 2j. These recurrence relations are similar to the previous recurrence relations. Together with the fact that Rm(k;k) is O(2k), this implies that R+(k;d) is O(r16d 2k). 2 5 Experimental Evaluation So far we have described and analyzed our algorithms at a theoretical level. To understand better how the algorithms work in practice, we have also implemented them on a MasPar MP-1 computer. The MasPar MP-1 consists of an array control unit and a proces- sor array. The array control unit controls the processor array, and the interaction between the front end computer and the processor ar- ray. In addition, the array control unit performs operations on scalar data. On the machine that we have access to, the processor array consists of a total of 16,384 processors arranged in a two-dimensional array of 128 rows and columns. Each processor in the processor ar- ray is a 1.8-MIPS processor, and has forty 32-bit registers, and 16 or 64 kilobytes of RAM. Communication between two processors in the processor array can be via X-net or Global Router, where X-net communications are restricted to be either horizontal, vertical or di- agonal. The Global Router allows communication between any pair of processors but its e ciency is very data dependent: if many proces- sors want to communicate with the same processor, the performance deteriorates dramatically. The MasPar can be programmed either by MPL or Fortran, where MPL is based on ANSI C with extensions for data parallelism. Since MPL allows direct control over the machine, we have used MPL. As already mentioned in Section 2, MPL also o ers some degree of exibility such as addressing autonomy. In our implementation we have as much as possible used the li- brary functions provided with MPL. More speci cally, library func- tions have been for global sum operations in the parallel cell method.
  • 16. 13 We have consistently used the Global Router for concurrent read/write operations. In only one case has this turned out to be problematic: in Step 6(b) of the parallel cell method it may happen that many pro- cessors rst copy the rst and last values from the same processor, and then continue to copy point coordinates from the same proces- sors. We avoid this by using randomization. To each hyperrectangle q is associated a random number, r(q), such that 0 r(q) 1. This number is used to modify the order in which cells and points are processed by a hyperrectangle. In both algorithms we need to sort records of data. There is no library function for this, but in the parallel cell method we can sort by using a ranking function, and then the Global Router to move each record to its correct location. For each active processor, the ranking function computes the rank of the value of a local variable. This approach does not work in the parallel divide-and-conquer method, since we then need simultaneously to sort records within submeshes. We have instead implemented sorting routines based on bitonic sort 12]. We have evaluated the algorithms for two kindsof two-dimensional input data. For both kinds of input there are 8192 points and 8192 equal-sized hypersquares. For uniform kind of input, points and squares are chosen at random from the unit square. For diagonal kind of input, points and squares are chosen at random along the di- agonal of the unit square (i.e., the diagonal of each square coincides with the diagonal of the unit square). The width of the squares is in each case chosen such that each square contains four points on average. The running times (in milliseconds (ms)) are as follows. The running time of the parallel cell method is 55 ms for uniform input, and 205 ms for diagonal input. For the parallel divide-and-conquer algorithm the corresponding running times are 1291 ms and 1207 ms. The parallel cell method is thus much faster than the parallel divide-and-conquer method for both kinds of input. Although the diagonal kind of input is not a worst-case input for the cell method, it is still fairly bad": on average we must test each square against at least 90 points. A substantial part of the running time of the parallel divide-and- conquer algorithm is used for sorting records. The sorting algorithm that we have implemented is asymptotically optimal, but it is likely that a more careful implementation could make it run faster. It is thus possible that we can improve the running time of the parallel divide-and-conquer algorithm considerably. To compare our algorithms with sequential algorithms for range searching, we have implemented the sequential cell method on our front end machine, a DECstation 5000 (Model 200, 25 MHz). The parallel cell method is 13{15 times faster than the sequential cell method for both kinds of input. The speedup is thus not very impres- sive. Partly this is due to unavoidable communication costs. Another
  • 17. 14 reason is that the front end computer is more powerful in terms of oating point operations: our measurements show that a point in- clusion test (that is, to test if a point lies within a hyperrectangle) is about forty times faster on the front end computer than on the MasPar MP-1. 6 Conclusions We have presented two algorithms for batched range searching on a mesh: one algorithm based on cell division, and another algo- rithm based on divide-and-conquer. The divide-and-conquer algo- rithm takes O((r+1)16dpn) time, where r is the maximum number of points contained in any hyperrectangle. We can show that if some constant independent of n bounds r, and the points contained in a hy- perrectangle must be stored in the processor that initially contained the hyperrectangle, then any algorithm must take (d(r + 1) pn) time in the worst case. For a xed dimension d, the divide-and- conquer algorithm is thus worst-case optimal (within a multiplicative constant). The cell method takes O((d + wmax) pn) time, where wmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells inter- sected by the hyperrectangle q and t(q) is the number of points tested for inclusion in q. Thus, this method may take (dnpn) time even when r = 0. However, as shown by Corollary 1 and our experimen- tal results, the cell method may outperform the divide-and-conquer method in practice. We require that both algorithms store copies of the points con- tained within a hyperrectangle in the processor that initially con- tained the hyperrectangle. An alternative would have been to design load-balanced algorithms, that store the output evenly distributed over the processors. However, such algorithms would be forced to spend time on various load-balancing activities. At least for the ap- plications that we consider, it is likely that such algorithms would be slower than the cell method. References 1] S.G. Akl. The design and analysis of parallel algorithms. Prentice-Hall International, London, UK, rst edition, 1989. 2] M.J. Atallah. Parallel techniques for computational geometry. Proc. IEEE, 80(9):1435{1448, 1992. 3] J.L. Bentley and J.H. Friedman. Data structures for range searching. Computing Surveys, 11:397{409, 1979. 4] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval- uation of range searching methods for contact searching in me- chanical engineering. To appear in Int. J. Computational Geom- etry & Applications.
  • 18. 15 5] C.S. Jeong and D.T. Lee. Parallel geometric algorithms on a mesh-connected computer. Algorithmica, 5:155{177, 1990. 6] M. Kumar and D.S. Hirschberg. An e cient implementation of Batcher's odd-even merge algorithm and its application in parallel sorting schemes. IEEE Transactions on Computers, C- 32(3):254{264, March 1983. 7] D. Nassimi and S. Sahni. Bitonic sort on a mesh-connected parallel computer. IEEE Transactions on Computers, C-28(1):2{ 7, January 1979. 8] D. Nassimi and S. Sahni. Data broadcasting in SIMD computers. IEEE Transactions on Computers, C-30(2):101{107, February 1981. 9] S-J. Oh and M. Suk. Parallel algorithms for geometric searching problems. In Proc. Supercomputing'89, pages 344{350, 1989. 10] F.P Preparata and M.I. Shamos. Computational geometry: An Introduction. Springer-Verlag, New York, NY, second edition, 1985. 11] R. Sedgewick. Algorithms. Addison-Wesley Publishing Compa- ny, Reading, MA, second edition, 1988. 12] C.D. Thompson and H.T. Kung. Sorting on a mesh-connected parallel computer. Communications of the ACM, 20(4):263{271, 1977.