Radar Processing with the IBM Cell Broadband Engine


Published on

Multicore processors are dominating the scene of comput- ing, and have provided a way to keep improving perform- ance while circumventing the power as well as the memory bandwidth wall. The IBM Cell Broadband Engine is a het- erogeneous multicore which as been designed with high throughput in mind. This paper explores its applicability to data processing, and reports the performance results on some important RADAR processing algorithms. Initial re- sults are very promising and highlight the disruptive poten- tial of this technology.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Radar Processing with the IBM Cell Broadband Engine

  1. 1. RADAR PROCESSING WITH THE IBM CELL BROADBAND ENGINE A. Corsaro, E. Giaccari, S. Nave, E. La Rosa J. Derby F. Casadei, A. Perciante PrismTech FINMECCANICA Galileo Avionica SELEX-SI IBM Quadrics SELEX Communications Keywords: Radar Processing, Multi-Core Processors, IBM processing, computer vision, etc., for obtaining Cell Broadband Engine, Performance Evaluation. unprecedented real-time performances as well as implementing more elaborated algorithms. ABSTRACT The goal of this paper is to highlight the potential of multi- Multicore processors are dominating the scene of comput- core architectures for radar processing as well as to report ing, and have provided a way to keep improving perform- the performance results obtained for some key radar ance while circumventing the power as well as the memory bandwidth wall. The IBM Cell Broadband Engine is a het- algorithms. erogeneous multicore which as been designed with high As discussed in the paper, among the multi-core processors throughput in mind. This paper explores its applicability to currently available on the market, we decided to focus our data processing, and reports the performance results on some important RADAR processing algorithms. Initial re- attention on the IBM Cell Broadband Engine (CBE) [4], as sults are very promising and highlight the disruptive poten- (1) its architectural features closely match the need of data tial of this technology. processing algorithms, (2) its impressive peak performances are unpaired, and have the potential for enabling disruptive 1. INTRODUCTION innovation in real-time data processing, and (3) it provides In the past few decades we have been experiencing Moore's a very good performance per watt ratio. Law prophecy [1], which has resulted in a steady increase CBE performances were evaluated by defining an in the processors computing power. These improvements application benchmark, as well as developing a series of were boosted by the technological advances in synthetic micro-benchmarks. As benchmarks definition is microelectronics and miniaturization forecast by Gordon often controversial, our approach in defining the application Moore. In the past few years, however, due to (1) the benchmark to evaluate the CBE was rather pragmatic. We approaching limit on miniaturization, (2) the widening gap took under consideration two algorithms, the Rotational between microprocessors and memory speeds, and (3) the Motion Compensation (RMC)—a fundamental building diminishing performance returns resulting from clock block for all airborne real-time Synthetic Aperture Radars frequency increases, the steady growth in micro-processors (SAR) imaging; and the Space Time Adaptive Processing performance seems to have reached a saturation point. (STAP)—the holy grail of radar analysts'. Other than being In order to overcome this performance wall, extremely relevant for our application domain, both microprocessor architects have realized that instead of algorithms are computationally and memory bandwidth going faster, a sensible approach was to use the chip area to intensive, and are thus excellent candidates for stressing the exploit coarser parallelism rather than what already strengths of a processor. For both algorithms we also had an provided by instruction level parallelism and thread level existing implementation which helped us comparing results parallelism. This has lead to an architectural innovation in as well as speedups. contemporary processors architectures which has resulted Our initial experience, detailed in the remainder of this in the creation of multi-core microprocessors [3]. paper, has shown that multi-core processors such as CBE Multi-core architectures are creating the potential for a leap can provide speedup, on radar processing algorithms, of 20 forward in the processing capability made available by a and 30 times when compared with the technology typically single chip. This has a great potential for computationally used today, such as PPC G4 or TigerSHARC DSP, while at hungry applications such as, radar processing, image the same time allowing a reduction in volume and power
  2. 2. roughly by an order of magnitude. Moreover, what really struck us was that these improvements, once the correct application partitioning is devised, are gained without too much programming effort, and in relatively little time. The reminder of the paper is organized as follows, Section 2 provides and overview of the IBM CBE; Section 3 describes the application benchmarks; Section 4 reports the performance results of the selected application benchmarks; finally Section 5 describe the future works and concluding remarks. 2. IBM CELL BROADBAND ENGINE (CBE) Figure 1 – IBM Cell Broadband Engine Architecture. Architectural Overview. The IBM Cell Broadband Engine is a heterogeneous multicore processor that, as shown in Figure 1, is composed by eight Synergistic Processing amounts of time. Elements (SPEs) and one 64-bit Power Processing Element (PPE). SPEs are 128-bit processor with a SIMD-RISC [2] instruction set and a unified register file of 128 registers, 3. RADAR PROCESSING BENCHMARK each of which 128-bit wide. The PPE is a 64-bit processor Benchmarks definition is often controversial as it takes a based on the PPC 970 architecture. good blend of art and science to define an objective These elements are interconnected to each other and to the benchmark. Our approach in defining the application main memory, by a ring-based bus, namely the Element benchmark to evaluated the CBE is rather pragmatic. We Interconnect Bus (EIB), capable of carrying up to 96 bytes took under consideration two algorithms, the Rotational per cycle. The PPE is capable of addressing directly the Motion Compensation (RMC)—a fundamental building main memory, all the other elements, i.e., the SPEs, access block for all airborne radars; and the Space Time Adaptive the main memory through DMA. SPEs are equipped with a Processing (STAP)—the holy grail of radar analysts'. Local Store (LS) of 256KByte which is used to store data Other than being extremely relevant for our application and code, and is under control of the programmer. domain, both algorithms are computationally and memory The CBE is able to deliver more than 200 GFLOPs when bandwidth intensive. For both algorithms we also had operating on single precision floating point, at a power of existing implementations optimized and tuned over the roughly 70W, providing an amazing GFLOP/Watt index. years on top of the class DSPs and micro-processors, which helped us comparing results as well as speedups. In the Programming the CBE. The CBE architecture has been reminder of this Section we provide a brief description of driven by the requirements of a wide set of application the two algorithms. domains such as computer gaming, multimedia stream processing, computer vision, data and signal processing, 3.1. ROTATIONAL MOTION COMPENSATION etc. As a result, although it might look at first harder to grasp and program, it fits application typical of this domain The Rotational Motion Compensation (RMC) algorithm is very well, leading to natural application design, typically used to remove the distortion induced in the partitioning, and implementation. The architectural choices measure caused airplane movement. Conceptually this at its foundation have traded performance and power algorithm is rather simple as it can be decomposed in FFT, efficiency with easy of programming. As a result, the CBE, IFFT, FFT-SHIFT and IFFT-SHIFT performed over either is a multicore for which to exploit its maximum potential the rows or the column of a complex matrix. In detail, given has to be programmed like a distributed system rather than an (N, M) complex matrix C, the algorithm was composed like multi-threaded system as in homogeneous multi-core by the following steps: processors. 1. For each row of C perform an FFT and SHIFT However, as we will see on the reminder of the paper the the elements by M/2 learning curve is not so steep, and it is possible to get up to 2. For every column k of C between 1 and M, speed with programming the CBE in relatively short extract the sub-matrix (N, 2K) centered on the kth
  3. 3. column. Call this matrix S algorithm. In our STAP implementation we relied on the Cholesky decomposition, for positive define complex 3. For each rows of S perform the following matrices, for efficiently computing a decomposition which operations: requires a forward and a backward substitution in order to i. IFFT find the linear system solution. ii. SHIFT by K/2 elements 4. For each of the columns S perform the 4. EMPIRICAL EVALUATION following operations: Testbed Setup. The testebed on which we evaluated the i. FFT performance of RMC and STAP consisted of: ii. Vector Product with a fixed vector • Dual Cell Blade QS20, 3.2GHz with 1GB SDR iii. IFFT running Linux Fedora Core 5 with Cell SDK 2.0 iv. Accumulate all the S’ matrix columns • TigerSHARC DSP 500MHz • MPC 7457 featuring a PPC Power G4 v. Substitute the kth column of C with the sum obtained at the previous point 5. Multiply each column on the resulting (N, M) 4.1. RMC RESULTS matrix with a constant scalar. Execution Time. The RMC was carefully coded to fully 6. For each row of the resulting matrix perform the exploit data parallelism as well as processing parallelism. following operations: Then, the execution time was measured when running the algorithm for a 2048x1024 matrix with K=64. i. IFFT ii. SHIFT elements by M/2 The complexity of the algorithm depends on the size of the matrix which is defined by the couple (N, M), and by the shift factor K. 3.2. SPACE TIME ADAPTIVE PROCESSING Figure 2 – RMC execution time vs. SPU number. The Space Time Adaptive Processing (STAP) is a signal processing technique commonly used in radar systems. It involves adaptive array processing algorithms to aid in target detection. Radar signal processing benefits from STAP in areas where interference is a problem (i.e. ground clutter, jamming, etc.). Through careful application of Figure 2 shows how the RMC’s execution time depends on STAP, it is possible to achieve order-of-magnitude the number of SPUs on which the computation is sensitivity improvements in target detection. parallelized. As it can be easily seen from the figure, the measured execution time scales practically linearly within a STAP involves a two-dimensional filtering technique using single CBE chip, i.e., going from 1 to 8 SPUs. When a phased array antenna with multiple spatial channels. relying on two CBE, and thus using up to 16 SPUs, the Applying the statistics of the interference environment, an scaling is sublinear, but still very good--especially if we adaptive STAP weight vector is formed. This weight vector consider that the application was not optimized for the dual is applied to the coherent samples received by the radar. cell configuration. By tuning the application for the dual From a numerical perspective, determining the STAP filter CBE configuration, and minimizing the inter-CBE vector requires, among other things, solving a linear communication we are confident that the latencies that lead system. Fixed the problem space, the linear system solution to sub-linear scaling could be completely hidden. is the computation that dominates the execution time of the Speedup. The execution time of an RMC algorithm coded
  4. 4. and optimized for a TigerSHARC was evaluated and compared with that of the CBE counterpart. The speedup was measured and is reported in Figure 3. This figure shows how a single SPU is almost 4 times faster than a TigerSHARC DSP, while exploiting on the full power of the CBE (8 SPUs) leads to a 26x speedup. The dual CBE configuration provides a 40x speedup, which as discussed above could be further improved by making the application dual-CBE aware. Figure 5 – STAP slowdown w.r.t. the ideal execution time saturates and the execution time scaling flattens out. Measurements on the used memory bandwidth revealed that the limited speedup when more than 3 SPUs are used was due to memory bandwidth saturation. Figure 3 – CBE/TigerSHARC speedup. :6//"85;"<= that wouldpercentage with respect to Figure 5 reports the slowdown the linear scaling #>? be ideally desirable to experience. From Figure 5 it is easy to see how the highest slowdown are experienced with large matrices and with more than 4 SPU. @67AB7CDEF6":())>@67ABCDEF6"G@:"H=IH %" %! @67AB7CDEF6"7D41B $" $! +,--.#./01 +,--.&./012 #" +,--.(./012 Figure 4 – STAP Normalized Execution Time #! " ! # $ % & " ' ( ) * #! ## #$ JKCL67"BA"7BM5 4.2. STAP RESULTS !"#$%&'%()*+%,)*%-.,/"0%1"2*%"+3/*.1*1%,)*%4*/56/-.+3*%/.,"6%7*,8**+%9:;;%.+<%G@:"H=II% "+3/*.1*1%,66%=+,">%",%1.,=/.,*1%",1%-*-6/?%7.+<8"<,)$ Figure 6 –CBE Speedup over PPC Power G4 (number of Execution Time. As it was explained earlier in the rows scaled by 100). paper ,the dominant portion of the STAP algorithm is the !"#$$%&"'()(*+',"-".//"012345"06567869 linear system solution. The problem with this is that Speedup. We evaluated the performance of the STAP algorithms for solving linear systems have many control algorithm over an MPC 7457 board featuring a PPC Power and data dependencies that limit the amount of computation G4. Figure 6 shows the speedup experienced for matrix of that can be carried in parallel. The implementation of the sizes ranging from (100, 100) to (1200, 1200), when using Cholesky decomposition we crafted for the CBE was very 1, 4, and 7 SPU respectively. The results show that the careful in extracting all the available parallelism, especially speedup consistently increases with the size of the matrix, at the data level. and as an example a single SPU is 3x faster than a G4 for Figure 4 shows the normalized execution time for the (100,100) matrices, but 10x for (1200, 1200). Moreover, the STAP for covariance matrices of size ranging from (100, speedup can be as much as 30x. 100) to (1024, 1024). As it can be seen from the graphics, the execution time scales linearly when the number of SPUs does not exceed 3. With more than 3 SPUs the system
  5. 5. 5. CONCLUDING REMARKS disruptive when compared with the technology commonly The IBM Cell Broadband Engine is a new multicore used today. processor that has been applied with great success in the REFERENCES context of game consoles such as the Play Station 3. It’s [1] G. E. Moore, “Cramming More Components onto Integrated architecture fits very well with the kind of workloads, as Circuits”, Electronics, vol. 38, n. 8, April 1965. well as the computational structure of problems common in [2] J. L. Hennessy, D. A. Patterson, “Computer Architecture: A Quantitative Approach”, 4th ed., Morgan Kaufmann, 2006. data and signal processing, thus making this processor an [3] J. L. Hennessy, D. A. Patterson, “A Conversation with John ideal solution for application in this domain. Initial Hennessy and David Patterson”, in ACM Queue vol. 4, n. 10, January 2007. benchmarking results shown in this paper confirm that the [4] IBM Cell Project, http://www.research.ibm.com/cell/ level of performance that can be achieved with the CBE are