Multicore processors are dominating the scene of comput- ing, and have provided a way to keep improving perform- ance while circumventing the power as well as the memory bandwidth wall. The IBM Cell Broadband Engine is a het- erogeneous multicore which as been designed with high throughput in mind. This paper explores its applicability to data processing, and reports the performance results on some important RADAR processing algorithms. Initial re- sults are very promising and highlight the disruptive poten- tial of this technology.
Scaling API-first – The story of a global engineering organization
Radar Processing with the IBM Cell Broadband Engine
1. RADAR PROCESSING WITH THE IBM CELL BROADBAND ENGINE
A. Corsaro, E. Giaccari, S. Nave, E. La Rosa J. Derby F. Casadei, A. Perciante
PrismTech FINMECCANICA Galileo Avionica SELEX-SI IBM Quadrics SELEX Communications
Keywords: Radar Processing, Multi-Core Processors, IBM processing, computer vision, etc., for obtaining
Cell Broadband Engine, Performance Evaluation. unprecedented real-time performances as well as
implementing more elaborated algorithms.
ABSTRACT The goal of this paper is to highlight the potential of multi-
Multicore processors are dominating the scene of comput- core architectures for radar processing as well as to report
ing, and have provided a way to keep improving perform- the performance results obtained for some key radar
ance while circumventing the power as well as the memory
bandwidth wall. The IBM Cell Broadband Engine is a het- algorithms.
erogeneous multicore which as been designed with high As discussed in the paper, among the multi-core processors
throughput in mind. This paper explores its applicability to
currently available on the market, we decided to focus our
data processing, and reports the performance results on
some important RADAR processing algorithms. Initial re- attention on the IBM Cell Broadband Engine (CBE) [4], as
sults are very promising and highlight the disruptive poten- (1) its architectural features closely match the need of data
tial of this technology. processing algorithms, (2) its impressive peak performances
are unpaired, and have the potential for enabling disruptive
1. INTRODUCTION innovation in real-time data processing, and (3) it provides
In the past few decades we have been experiencing Moore's a very good performance per watt ratio.
Law prophecy [1], which has resulted in a steady increase
CBE performances were evaluated by defining an
in the processors computing power. These improvements
application benchmark, as well as developing a series of
were boosted by the technological advances in
synthetic micro-benchmarks. As benchmarks definition is
microelectronics and miniaturization forecast by Gordon
often controversial, our approach in defining the application
Moore. In the past few years, however, due to (1) the
benchmark to evaluate the CBE was rather pragmatic. We
approaching limit on miniaturization, (2) the widening gap
took under consideration two algorithms, the Rotational
between microprocessors and memory speeds, and (3) the
Motion Compensation (RMC)—a fundamental building
diminishing performance returns resulting from clock
block for all airborne real-time Synthetic Aperture Radars
frequency increases, the steady growth in micro-processors
(SAR) imaging; and the Space Time Adaptive Processing
performance seems to have reached a saturation point.
(STAP)—the holy grail of radar analysts'. Other than being
In order to overcome this performance wall, extremely relevant for our application domain, both
microprocessor architects have realized that instead of algorithms are computationally and memory bandwidth
going faster, a sensible approach was to use the chip area to intensive, and are thus excellent candidates for stressing the
exploit coarser parallelism rather than what already strengths of a processor. For both algorithms we also had an
provided by instruction level parallelism and thread level existing implementation which helped us comparing results
parallelism. This has lead to an architectural innovation in as well as speedups.
contemporary processors architectures which has resulted
Our initial experience, detailed in the remainder of this
in the creation of multi-core microprocessors [3].
paper, has shown that multi-core processors such as CBE
Multi-core architectures are creating the potential for a leap can provide speedup, on radar processing algorithms, of 20
forward in the processing capability made available by a and 30 times when compared with the technology typically
single chip. This has a great potential for computationally used today, such as PPC G4 or TigerSHARC DSP, while at
hungry applications such as, radar processing, image the same time allowing a reduction in volume and power
2. roughly by an order of magnitude. Moreover, what really
struck us was that these improvements, once the correct
application partitioning is devised, are gained without too
much programming effort, and in relatively little time.
The reminder of the paper is organized as follows, Section
2 provides and overview of the IBM CBE; Section 3
describes the application benchmarks; Section 4 reports the
performance results of the selected application benchmarks;
finally Section 5 describe the future works and concluding
remarks.
2. IBM CELL BROADBAND ENGINE (CBE)
Figure 1 – IBM Cell Broadband Engine Architecture.
Architectural Overview. The IBM Cell Broadband Engine
is a heterogeneous multicore processor that, as shown in
Figure 1, is composed by eight Synergistic Processing amounts of time.
Elements (SPEs) and one 64-bit Power Processing Element
(PPE). SPEs are 128-bit processor with a SIMD-RISC [2]
instruction set and a unified register file of 128 registers, 3. RADAR PROCESSING BENCHMARK
each of which 128-bit wide. The PPE is a 64-bit processor
Benchmarks definition is often controversial as it takes a
based on the PPC 970 architecture.
good blend of art and science to define an objective
These elements are interconnected to each other and to the benchmark. Our approach in defining the application
main memory, by a ring-based bus, namely the Element benchmark to evaluated the CBE is rather pragmatic. We
Interconnect Bus (EIB), capable of carrying up to 96 bytes took under consideration two algorithms, the Rotational
per cycle. The PPE is capable of addressing directly the Motion Compensation (RMC)—a fundamental building
main memory, all the other elements, i.e., the SPEs, access block for all airborne radars; and the Space Time Adaptive
the main memory through DMA. SPEs are equipped with a Processing (STAP)—the holy grail of radar analysts'.
Local Store (LS) of 256KByte which is used to store data
Other than being extremely relevant for our application
and code, and is under control of the programmer.
domain, both algorithms are computationally and memory
The CBE is able to deliver more than 200 GFLOPs when bandwidth intensive. For both algorithms we also had
operating on single precision floating point, at a power of existing implementations optimized and tuned over the
roughly 70W, providing an amazing GFLOP/Watt index. years on top of the class DSPs and micro-processors, which
helped us comparing results as well as speedups. In the
Programming the CBE. The CBE architecture has been
reminder of this Section we provide a brief description of
driven by the requirements of a wide set of application
the two algorithms.
domains such as computer gaming, multimedia stream
processing, computer vision, data and signal processing, 3.1. ROTATIONAL MOTION COMPENSATION
etc. As a result, although it might look at first harder to
grasp and program, it fits application typical of this domain The Rotational Motion Compensation (RMC) algorithm is
very well, leading to natural application design, typically used to remove the distortion induced in the
partitioning, and implementation. The architectural choices measure caused airplane movement. Conceptually this
at its foundation have traded performance and power algorithm is rather simple as it can be decomposed in FFT,
efficiency with easy of programming. As a result, the CBE, IFFT, FFT-SHIFT and IFFT-SHIFT performed over either
is a multicore for which to exploit its maximum potential the rows or the column of a complex matrix. In detail, given
has to be programmed like a distributed system rather than an (N, M) complex matrix C, the algorithm was composed
like multi-threaded system as in homogeneous multi-core by the following steps:
processors. 1. For each row of C perform an FFT and SHIFT
However, as we will see on the reminder of the paper the the elements by M/2
learning curve is not so steep, and it is possible to get up to 2. For every column k of C between 1 and M,
speed with programming the CBE in relatively short extract the sub-matrix (N, 2K) centered on the kth
3. column. Call this matrix S algorithm. In our STAP implementation we relied on the
Cholesky decomposition, for positive define complex
3. For each rows of S perform the following
matrices, for efficiently computing a decomposition which
operations:
requires a forward and a backward substitution in order to
i. IFFT find the linear system solution.
ii. SHIFT by K/2 elements
4. For each of the columns S perform the 4. EMPIRICAL EVALUATION
following operations:
Testbed Setup. The testebed on which we evaluated the
i. FFT performance of RMC and STAP consisted of:
ii. Vector Product with a fixed vector
• Dual Cell Blade QS20, 3.2GHz with 1GB SDR
iii. IFFT running Linux Fedora Core 5 with Cell SDK 2.0
iv. Accumulate all the S’ matrix columns • TigerSHARC DSP 500MHz
• MPC 7457 featuring a PPC Power G4
v. Substitute the kth column of C with the
sum obtained at the previous point
5. Multiply each column on the resulting (N, M) 4.1. RMC RESULTS
matrix with a constant scalar. Execution Time. The RMC was carefully coded to fully
6. For each row of the resulting matrix perform the exploit data parallelism as well as processing parallelism.
following operations: Then, the execution time was measured when running the
algorithm for a 2048x1024 matrix with K=64.
i. IFFT
ii. SHIFT elements by M/2
The complexity of the algorithm depends on the size of the
matrix which is defined by the couple (N, M), and by the
shift factor K.
3.2. SPACE TIME ADAPTIVE PROCESSING
Figure 2 – RMC execution time vs. SPU number.
The Space Time Adaptive Processing (STAP) is a signal
processing technique commonly used in radar systems. It
involves adaptive array processing algorithms to aid in
target detection. Radar signal processing benefits from
STAP in areas where interference is a problem (i.e. ground
clutter, jamming, etc.). Through careful application of Figure 2 shows how the RMC’s execution time depends on
STAP, it is possible to achieve order-of-magnitude the number of SPUs on which the computation is
sensitivity improvements in target detection. parallelized. As it can be easily seen from the figure, the
measured execution time scales practically linearly within a
STAP involves a two-dimensional filtering technique using single CBE chip, i.e., going from 1 to 8 SPUs. When
a phased array antenna with multiple spatial channels. relying on two CBE, and thus using up to 16 SPUs, the
Applying the statistics of the interference environment, an scaling is sublinear, but still very good--especially if we
adaptive STAP weight vector is formed. This weight vector consider that the application was not optimized for the dual
is applied to the coherent samples received by the radar. cell configuration. By tuning the application for the dual
From a numerical perspective, determining the STAP filter CBE configuration, and minimizing the inter-CBE
vector requires, among other things, solving a linear communication we are confident that the latencies that lead
system. Fixed the problem space, the linear system solution to sub-linear scaling could be completely hidden.
is the computation that dominates the execution time of the Speedup. The execution time of an RMC algorithm coded
4. and optimized for a TigerSHARC was evaluated and
compared with that of the CBE counterpart. The speedup
was measured and is reported in Figure 3. This figure shows
how a single SPU is almost 4 times faster than a
TigerSHARC DSP, while exploiting on the full power of
the CBE (8 SPUs) leads to a 26x speedup. The dual CBE
configuration provides a 40x speedup, which as discussed
above could be further improved by making the application
dual-CBE aware.
Figure 5 – STAP slowdown w.r.t. the ideal execution time
saturates and the execution time scaling flattens out.
Measurements on the used memory bandwidth revealed that
the limited speedup when more than 3 SPUs are used was
due to memory bandwidth saturation.
Figure 3 – CBE/TigerSHARC speedup. :6//"85;"<= that wouldpercentage with respect to
Figure 5 reports the slowdown
the linear scaling
#>?
be ideally desirable
to
experience. From Figure 5 it is easy to see how the highest
slowdown are experienced with large matrices and with
more than 4 SPU.
@67AB7CDEF6":())>@67ABCDEF6"G@:"H=IH
%"
%!
@67AB7CDEF6"7D41B
$"
$! +,--.#./01
+,--.&./012
#"
+,--.(./012
Figure 4 – STAP Normalized Execution Time #!
"
!
# $ % & " ' ( ) * #! ## #$
JKCL67"BA"7BM5
4.2. STAP RESULTS !"#$%&'%()*+%,)*%-.,/"0%1"2*%"+3/*.1*1%,)*%4*/56/-.+3*%/.,"6%7*,8**+%9:;;%.+<%G@:"H=II%
"+3/*.1*1%,66%=+,">%",%1.,=/.,*1%",1%-*-6/?%7.+<8"<,)$
Figure 6 –CBE Speedup over PPC Power G4 (number of
Execution Time. As it was explained earlier in the rows scaled by 100).
paper ,the dominant portion of the STAP algorithm is the !"#$$%&"'()(*+',"-".//"012345"06567869
linear system solution. The problem with this is that Speedup. We evaluated the performance of the STAP
algorithms for solving linear systems have many control algorithm over an MPC 7457 board featuring a PPC Power
and data dependencies that limit the amount of computation G4. Figure 6 shows the speedup experienced for matrix of
that can be carried in parallel. The implementation of the sizes ranging from (100, 100) to (1200, 1200), when using
Cholesky decomposition we crafted for the CBE was very 1, 4, and 7 SPU respectively. The results show that the
careful in extracting all the available parallelism, especially speedup consistently increases with the size of the matrix,
at the data level. and as an example a single SPU is 3x faster than a G4 for
Figure 4 shows the normalized execution time for the (100,100) matrices, but 10x for (1200, 1200). Moreover, the
STAP for covariance matrices of size ranging from (100, speedup can be as much as 30x.
100) to (1024, 1024). As it can be seen from the graphics,
the execution time scales linearly when the number of SPUs
does not exceed 3. With more than 3 SPUs the system
5. 5. CONCLUDING REMARKS disruptive when compared with the technology commonly
The IBM Cell Broadband Engine is a new multicore used today.
processor that has been applied with great success in the REFERENCES
context of game consoles such as the Play Station 3. It’s
[1] G. E. Moore, “Cramming More Components onto Integrated
architecture fits very well with the kind of workloads, as Circuits”, Electronics, vol. 38, n. 8, April 1965.
well as the computational structure of problems common in [2] J. L. Hennessy, D. A. Patterson, “Computer Architecture: A
Quantitative Approach”, 4th ed., Morgan Kaufmann, 2006.
data and signal processing, thus making this processor an [3] J. L. Hennessy, D. A. Patterson, “A Conversation with John
ideal solution for application in this domain. Initial Hennessy and David Patterson”, in ACM Queue vol. 4, n. 10,
January 2007.
benchmarking results shown in this paper confirm that the [4] IBM Cell Project, http://www.research.ibm.com/cell/
level of performance that can be achieved with the CBE are