SlideShare a Scribd company logo
1 of 5
Download to read offline
RADAR PROCESSING WITH THE IBM CELL BROADBAND ENGINE




     A. Corsaro,        E. Giaccari,  S. Nave,               E. La Rosa    J. Derby    F. Casadei,        A. Perciante
     PrismTech       FINMECCANICA Galileo Avionica           SELEX-SI        IBM        Quadrics     SELEX Communications




Keywords: Radar Processing, Multi-Core Processors, IBM           processing, computer vision, etc., for              obtaining
Cell Broadband Engine, Performance Evaluation.                   unprecedented real-time performances as             well as
                                                                 implementing more elaborated algorithms.
                         ABSTRACT                                The goal of this paper is to highlight the potential of multi-
Multicore processors are dominating the scene of comput-         core architectures for radar processing as well as to report
ing, and have provided a way to keep improving perform-          the performance results obtained for some key radar
ance while circumventing the power as well as the memory
bandwidth wall. The IBM Cell Broadband Engine is a het-          algorithms.
erogeneous multicore which as been designed with high            As discussed in the paper, among the multi-core processors
throughput in mind. This paper explores its applicability to
                                                                 currently available on the market, we decided to focus our
data processing, and reports the performance results on
some important RADAR processing algorithms. Initial re-          attention on the IBM Cell Broadband Engine (CBE) [4], as
sults are very promising and highlight the disruptive poten-     (1) its architectural features closely match the need of data
tial of this technology.                                         processing algorithms, (2) its impressive peak performances
                                                                 are unpaired, and have the potential for enabling disruptive
1.     INTRODUCTION                                              innovation in real-time data processing, and (3) it provides
In the past few decades we have been experiencing Moore's        a very good performance per watt ratio.
Law prophecy [1], which has resulted in a steady increase
                                                                 CBE performances were evaluated by defining an
in the processors computing power. These improvements
                                                                 application benchmark, as well as developing a series of
were boosted by the technological advances in
                                                                 synthetic micro-benchmarks. As benchmarks definition is
microelectronics and miniaturization forecast by Gordon
                                                                 often controversial, our approach in defining the application
Moore. In the past few years, however, due to (1) the
                                                                 benchmark to evaluate the CBE was rather pragmatic. We
approaching limit on miniaturization, (2) the widening gap
                                                                 took under consideration two algorithms, the Rotational
between microprocessors and memory speeds, and (3) the
                                                                 Motion Compensation (RMC)—a fundamental building
diminishing performance returns resulting from clock
                                                                 block for all airborne real-time Synthetic Aperture Radars
frequency increases, the steady growth in micro-processors
                                                                 (SAR) imaging; and the Space Time Adaptive Processing
performance seems to have reached a saturation point.
                                                                 (STAP)—the holy grail of radar analysts'. Other than being
In      order to overcome this performance wall,                 extremely relevant for our application domain, both
microprocessor architects have realized that instead of          algorithms are computationally and memory bandwidth
going faster, a sensible approach was to use the chip area to    intensive, and are thus excellent candidates for stressing the
exploit coarser parallelism rather than what already             strengths of a processor. For both algorithms we also had an
provided by instruction level parallelism and thread level       existing implementation which helped us comparing results
parallelism. This has lead to an architectural innovation in     as well as speedups.
contemporary processors architectures which has resulted
                                                                 Our initial experience, detailed in the remainder of this
in the creation of multi-core microprocessors [3].
                                                                 paper, has shown that multi-core processors such as CBE
Multi-core architectures are creating the potential for a leap   can provide speedup, on radar processing algorithms, of 20
forward in the processing capability made available by a         and 30 times when compared with the technology typically
single chip. This has a great potential for computationally      used today, such as PPC G4 or TigerSHARC DSP, while at
hungry applications such as, radar processing, image             the same time allowing a reduction in volume and power
roughly by an order of magnitude. Moreover, what really
struck us was that these improvements, once the correct
application partitioning is devised, are gained without too
much programming effort, and in relatively little time.
The reminder of the paper is organized as follows, Section
2 provides and overview of the IBM CBE; Section 3
describes the application benchmarks; Section 4 reports the
performance results of the selected application benchmarks;
finally Section 5 describe the future works and concluding
remarks.



2.     IBM CELL BROADBAND ENGINE (CBE)
                                                                       Figure 1 – IBM Cell Broadband Engine Architecture.
Architectural Overview. The IBM Cell Broadband Engine
is a heterogeneous multicore processor that, as shown in
Figure 1, is composed by eight Synergistic Processing             amounts of time.
Elements (SPEs) and one 64-bit Power Processing Element
(PPE). SPEs are 128-bit processor with a SIMD-RISC [2]
instruction set and a unified register file of 128 registers,       3.      RADAR PROCESSING BENCHMARK
each of which 128-bit wide. The PPE is a 64-bit processor
                                                                  Benchmarks definition is often controversial as it takes a
based on the PPC 970 architecture.
                                                                  good blend of art and science to define an objective
These elements are interconnected to each other and to the        benchmark. Our approach in defining the application
main memory, by a ring-based bus, namely the Element              benchmark to evaluated the CBE is rather pragmatic. We
Interconnect Bus (EIB), capable of carrying up to 96 bytes        took under consideration two algorithms, the Rotational
per cycle. The PPE is capable of addressing directly the          Motion Compensation (RMC)—a fundamental building
main memory, all the other elements, i.e., the SPEs, access       block for all airborne radars; and the Space Time Adaptive
the main memory through DMA. SPEs are equipped with a             Processing (STAP)—the holy grail of radar analysts'.
Local Store (LS) of 256KByte which is used to store data
                                                                  Other than being extremely relevant for our application
and code, and is under control of the programmer.
                                                                  domain, both algorithms are computationally and memory
The CBE is able to deliver more than 200 GFLOPs when              bandwidth intensive. For both algorithms we also had
operating on single precision floating point, at a power of        existing implementations optimized and tuned over the
roughly 70W, providing an amazing GFLOP/Watt index.               years on top of the class DSPs and micro-processors, which
                                                                  helped us comparing results as well as speedups. In the
Programming the CBE. The CBE architecture has been
                                                                  reminder of this Section we provide a brief description of
driven by the requirements of a wide set of application
                                                                  the two algorithms.
domains such as computer gaming, multimedia stream
processing, computer vision, data and signal processing,          3.1.    ROTATIONAL MOTION COMPENSATION
etc. As a result, although it might look at first harder to
grasp and program, it fits application typical of this domain      The Rotational Motion Compensation (RMC) algorithm is
very well, leading to natural application design,                 typically used to remove the distortion induced in the
partitioning, and implementation. The architectural choices       measure caused airplane movement. Conceptually this
at its foundation have traded performance and power               algorithm is rather simple as it can be decomposed in FFT,
efficiency with easy of programming. As a result, the CBE,         IFFT, FFT-SHIFT and IFFT-SHIFT performed over either
is a multicore for which to exploit its maximum potential         the rows or the column of a complex matrix. In detail, given
has to be programmed like a distributed system rather than        an (N, M) complex matrix C, the algorithm was composed
like multi-threaded system as in homogeneous multi-core           by the following steps:
processors.                                                                 1. For each row of C perform an FFT and SHIFT
However, as we will see on the reminder of the paper the                    the elements by M/2
learning curve is not so steep, and it is possible to get up to             2. For every column k of C between 1 and M,
speed with programming the CBE in relatively short                          extract the sub-matrix (N, 2K) centered on the kth
column. Call this matrix S                            algorithm. In our STAP implementation we relied on the
                                                              Cholesky decomposition, for positive define complex
        3. For each rows of S perform the following
                                                              matrices, for efficiently computing a decomposition which
        operations:
                                                              requires a forward and a backward substitution in order to
                 i. IFFT                                      find the linear system solution.
                 ii. SHIFT by K/2 elements
        4. For each of the columns S perform the              4.       EMPIRICAL EVALUATION
        following operations:
                                                              Testbed Setup. The testebed on which we evaluated the
                 i. FFT                                       performance of RMC and STAP consisted of:
                 ii. Vector Product with a fixed vector
                                                                         • Dual Cell Blade QS20, 3.2GHz with 1GB SDR
                 iii. IFFT                                               running Linux Fedora Core 5 with Cell SDK 2.0
                 iv. Accumulate all the S’ matrix columns                • TigerSHARC DSP 500MHz
                                                                         • MPC 7457 featuring a PPC Power G4
                 v. Substitute the kth column of C with the
                 sum obtained at the previous point
        5. Multiply each column on the resulting (N, M)       4.1.     RMC RESULTS
        matrix with a constant scalar.                        Execution Time. The RMC was carefully coded to fully
        6. For each row of the resulting matrix perform the   exploit data parallelism as well as processing parallelism.
        following operations:                                 Then, the execution time was measured when running the
                                                              algorithm for a 2048x1024 matrix with K=64.
                 i. IFFT
                 ii. SHIFT elements by M/2


The complexity of the algorithm depends on the size of the
matrix which is defined by the couple (N, M), and by the
shift factor K.



3.2.   SPACE TIME ADAPTIVE PROCESSING
                                                                     Figure 2 – RMC execution time vs. SPU number.
The Space Time Adaptive Processing (STAP) is a signal
processing technique commonly used in radar systems. It
involves adaptive array processing algorithms to aid in
target detection. Radar signal processing benefits from
STAP in areas where interference is a problem (i.e. ground
clutter, jamming, etc.). Through careful application of       Figure 2 shows how the RMC’s execution time depends on
STAP, it is possible to achieve order-of-magnitude            the number of SPUs on which the computation is
sensitivity improvements in target detection.                 parallelized. As it can be easily seen from the figure, the
                                                              measured execution time scales practically linearly within a
STAP involves a two-dimensional filtering technique using      single CBE chip, i.e., going from 1 to 8 SPUs. When
a phased array antenna with multiple spatial channels.        relying on two CBE, and thus using up to 16 SPUs, the
Applying the statistics of the interference environment, an   scaling is sublinear, but still very good--especially if we
adaptive STAP weight vector is formed. This weight vector     consider that the application was not optimized for the dual
is applied to the coherent samples received by the radar.     cell configuration. By tuning the application for the dual
From a numerical perspective, determining the STAP filter      CBE configuration, and minimizing the inter-CBE
vector requires, among other things, solving a linear         communication we are confident that the latencies that lead
system. Fixed the problem space, the linear system solution   to sub-linear scaling could be completely hidden.
is the computation that dominates the execution time of the   Speedup. The execution time of an RMC algorithm coded
and optimized for a TigerSHARC was evaluated and
compared with that of the CBE counterpart. The speedup
was measured and is reported in Figure 3. This figure shows
how a single SPU is almost 4 times faster than a
TigerSHARC DSP, while exploiting on the full power of
the CBE (8 SPUs) leads to a 26x speedup. The dual CBE
configuration provides a 40x speedup, which as discussed
above could be further improved by making the application
dual-CBE aware.

                                                                      Figure 5 – STAP slowdown w.r.t. the ideal execution time


                                                                  saturates and the execution time scaling flattens out.
                                                                  Measurements on the used memory bandwidth revealed that
                                                                  the limited speedup when more than 3 SPUs are used was
                                                                  due to memory bandwidth saturation.

        Figure 3 – CBE/TigerSHARC speedup.                  :6//"85;"<= that wouldpercentage with respect to
                                                              Figure 5 reports the slowdown
                                                              the linear scaling
                                                                                                              #>?
                                                                                            be ideally desirable
                                                                                                                 to

                                                                  experience. From Figure 5 it is easy to see how the highest
                                                                  slowdown are experienced with large matrices and with
                                                                  more than 4 SPU.



                                                                                                       @67AB7CDEF6":())>@67ABCDEF6"G@:"H=IH

                                                                                          %"

                                                                                          %!
                                                                      @67AB7CDEF6"7D41B




                                                                                          $"

                                                                                          $!                                                               +,--.#./01
                                                                                                                                                           +,--.&./012
                                                                                          #"
                                                                                                                                                           +,--.(./012
       Figure 4 – STAP Normalized Execution Time                                          #!

                                                                                           "

                                                                                           !
                                                                                               #   $     %    &    "    '    (    )    *    #!   ##   #$
                                                                                                                  JKCL67"BA"7BM5


4.2.   STAP RESULTS                                                  !"#$%&'%()*+%,)*%-.,/"0%1"2*%"+3/*.1*1%,)*%4*/56/-.+3*%/.,"6%7*,8**+%9:;;%.+<%G@:"H=II%
                                                                     "+3/*.1*1%,66%=+,">%",%1.,=/.,*1%",1%-*-6/?%7.+<8"<,)$
                                                                      Figure 6 –CBE Speedup over PPC Power G4 (number of
Execution Time. As it was explained earlier in the                    rows scaled by 100).
paper ,the dominant portion of the STAP algorithm is the                                                !"#$$%&"'()(*+',"-".//"012345"06567869

linear system solution. The problem with this is that             Speedup. We evaluated the performance of the STAP
algorithms for solving linear systems have many control           algorithm over an MPC 7457 board featuring a PPC Power
and data dependencies that limit the amount of computation        G4. Figure 6 shows the speedup experienced for matrix of
that can be carried in parallel. The implementation of the        sizes ranging from (100, 100) to (1200, 1200), when using
Cholesky decomposition we crafted for the CBE was very            1, 4, and 7 SPU respectively. The results show that the
careful in extracting all the available parallelism, especially   speedup consistently increases with the size of the matrix,
at the data level.                                                and as an example a single SPU is 3x faster than a G4 for
Figure 4 shows the normalized execution time for the              (100,100) matrices, but 10x for (1200, 1200). Moreover, the
STAP for covariance matrices of size ranging from (100,           speedup can be as much as 30x.
100) to (1024, 1024). As it can be seen from the graphics,
the execution time scales linearly when the number of SPUs
does not exceed 3. With more than 3 SPUs the system
5.    CONCLUDING REMARKS                                     disruptive when compared with the technology commonly
The IBM Cell Broadband Engine is a new multicore             used today.
processor that has been applied with great success in the    REFERENCES
context of game consoles such as the Play Station 3. It’s
                                                             [1]   G. E. Moore, “Cramming More Components onto Integrated
architecture fits very well with the kind of workloads, as          Circuits”, Electronics, vol. 38, n. 8, April 1965.
well as the computational structure of problems common in    [2]   J. L. Hennessy, D. A. Patterson, “Computer Architecture: A
                                                                   Quantitative Approach”, 4th ed., Morgan Kaufmann, 2006.
data and signal processing, thus making this processor an    [3]   J. L. Hennessy, D. A. Patterson, “A Conversation with John
ideal solution for application in this domain. Initial             Hennessy and David Patterson”, in ACM Queue vol. 4, n. 10,
                                                                   January 2007.
benchmarking results shown in this paper confirm that the     [4]   IBM Cell Project, http://www.research.ibm.com/cell/
level of performance that can be achieved with the CBE are

More Related Content

Viewers also liked

Nanga 2009
Nanga 2009Nanga 2009
Nanga 2009
khan333
 
House For Rent in Montgomery Alabama 2009
House For Rent in Montgomery Alabama 2009House For Rent in Montgomery Alabama 2009
House For Rent in Montgomery Alabama 2009
sad asad
 
ikd312-04-aljabar-relasional
ikd312-04-aljabar-relasionalikd312-04-aljabar-relasional
ikd312-04-aljabar-relasional
Anung Ariwibowo
 
Microsoft history
Microsoft historyMicrosoft history
Microsoft history
Virus91
 
Cyberpolitics 2009 W2
Cyberpolitics 2009 W2Cyberpolitics 2009 W2
Cyberpolitics 2009 W2
oiwan
 
Osservatorio sul turismo Scolastico 2012
Osservatorio sul turismo Scolastico 2012 Osservatorio sul turismo Scolastico 2012
Osservatorio sul turismo Scolastico 2012
Jacopo Zurlo
 
before traveling
before travelingbefore traveling
before traveling
June Song
 

Viewers also liked (18)

Nanga 2009
Nanga 2009Nanga 2009
Nanga 2009
 
House For Rent in Montgomery Alabama 2009
House For Rent in Montgomery Alabama 2009House For Rent in Montgomery Alabama 2009
House For Rent in Montgomery Alabama 2009
 
ikd312-04-aljabar-relasional
ikd312-04-aljabar-relasionalikd312-04-aljabar-relasional
ikd312-04-aljabar-relasional
 
Microsoft history
Microsoft historyMicrosoft history
Microsoft history
 
Hibernating DDS
Hibernating DDSHibernating DDS
Hibernating DDS
 
Sph 107 Ch 7
Sph 107 Ch 7Sph 107 Ch 7
Sph 107 Ch 7
 
Vagrant
VagrantVagrant
Vagrant
 
Hr Managers Presentation
Hr Managers PresentationHr Managers Presentation
Hr Managers Presentation
 
Future Of Opt Outs
Future Of Opt OutsFuture Of Opt Outs
Future Of Opt Outs
 
Cyberpolitics 2009 W2
Cyberpolitics 2009 W2Cyberpolitics 2009 W2
Cyberpolitics 2009 W2
 
Osservatorio sul turismo Scolastico 2012
Osservatorio sul turismo Scolastico 2012 Osservatorio sul turismo Scolastico 2012
Osservatorio sul turismo Scolastico 2012
 
Archydro
ArchydroArchydro
Archydro
 
Work Samples
Work SamplesWork Samples
Work Samples
 
before traveling
before travelingbefore traveling
before traveling
 
Greetings
GreetingsGreetings
Greetings
 
PROYECTO WorkCentre
PROYECTO WorkCentrePROYECTO WorkCentre
PROYECTO WorkCentre
 
Rupert.Reading.Jan 2015
Rupert.Reading.Jan 2015 Rupert.Reading.Jan 2015
Rupert.Reading.Jan 2015
 
Bonsai
BonsaiBonsai
Bonsai
 

More from Angelo Corsaro

More from Angelo Corsaro (20)

Zenoh: The Genesis
Zenoh: The GenesisZenoh: The Genesis
Zenoh: The Genesis
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
 
Zenoh Tutorial
Zenoh TutorialZenoh Tutorial
Zenoh Tutorial
 
Data Decentralisation: Efficiency, Privacy and Fair Monetisation
Data Decentralisation: Efficiency, Privacy and Fair MonetisationData Decentralisation: Efficiency, Privacy and Fair Monetisation
Data Decentralisation: Efficiency, Privacy and Fair Monetisation
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query compute
 
zenoh -- the ZEro Network OverHead protocol
zenoh -- the ZEro Network OverHead protocolzenoh -- the ZEro Network OverHead protocol
zenoh -- the ZEro Network OverHead protocol
 
zenoh -- the ZEro Network OverHead protocol
zenoh -- the ZEro Network OverHead protocolzenoh -- the ZEro Network OverHead protocol
zenoh -- the ZEro Network OverHead protocol
 
Breaking the Edge -- A Journey Through Cloud, Edge and Fog Computing
Breaking the Edge -- A Journey Through Cloud, Edge and Fog ComputingBreaking the Edge -- A Journey Through Cloud, Edge and Fog Computing
Breaking the Edge -- A Journey Through Cloud, Edge and Fog Computing
 
Eastern Sicily
Eastern SicilyEastern Sicily
Eastern Sicily
 
fog05: The Fog Computing Infrastructure
fog05: The Fog Computing Infrastructurefog05: The Fog Computing Infrastructure
fog05: The Fog Computing Infrastructure
 
Cyclone DDS: Sharing Data in the IoT Age
Cyclone DDS: Sharing Data in the IoT AgeCyclone DDS: Sharing Data in the IoT Age
Cyclone DDS: Sharing Data in the IoT Age
 
fog05: The Fog Computing Platform
fog05: The Fog Computing Platformfog05: The Fog Computing Platform
fog05: The Fog Computing Platform
 
Programming in Scala - Lecture Four
Programming in Scala - Lecture FourProgramming in Scala - Lecture Four
Programming in Scala - Lecture Four
 
Programming in Scala - Lecture Three
Programming in Scala - Lecture ThreeProgramming in Scala - Lecture Three
Programming in Scala - Lecture Three
 
Programming in Scala - Lecture Two
Programming in Scala - Lecture TwoProgramming in Scala - Lecture Two
Programming in Scala - Lecture Two
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Data Sharing in Extremely Resource Constrained Envionrments
Data Sharing in Extremely Resource Constrained EnvionrmentsData Sharing in Extremely Resource Constrained Envionrments
Data Sharing in Extremely Resource Constrained Envionrments
 
The DDS Security Standard
The DDS Security StandardThe DDS Security Standard
The DDS Security Standard
 
The Data Distribution Service
The Data Distribution ServiceThe Data Distribution Service
The Data Distribution Service
 
RUSTing -- Partially Ordered Rust Programming Ruminations
RUSTing -- Partially Ordered Rust Programming RuminationsRUSTing -- Partially Ordered Rust Programming Ruminations
RUSTing -- Partially Ordered Rust Programming Ruminations
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Radar Processing with the IBM Cell Broadband Engine

  • 1. RADAR PROCESSING WITH THE IBM CELL BROADBAND ENGINE A. Corsaro, E. Giaccari, S. Nave, E. La Rosa J. Derby F. Casadei, A. Perciante PrismTech FINMECCANICA Galileo Avionica SELEX-SI IBM Quadrics SELEX Communications Keywords: Radar Processing, Multi-Core Processors, IBM processing, computer vision, etc., for obtaining Cell Broadband Engine, Performance Evaluation. unprecedented real-time performances as well as implementing more elaborated algorithms. ABSTRACT The goal of this paper is to highlight the potential of multi- Multicore processors are dominating the scene of comput- core architectures for radar processing as well as to report ing, and have provided a way to keep improving perform- the performance results obtained for some key radar ance while circumventing the power as well as the memory bandwidth wall. The IBM Cell Broadband Engine is a het- algorithms. erogeneous multicore which as been designed with high As discussed in the paper, among the multi-core processors throughput in mind. This paper explores its applicability to currently available on the market, we decided to focus our data processing, and reports the performance results on some important RADAR processing algorithms. Initial re- attention on the IBM Cell Broadband Engine (CBE) [4], as sults are very promising and highlight the disruptive poten- (1) its architectural features closely match the need of data tial of this technology. processing algorithms, (2) its impressive peak performances are unpaired, and have the potential for enabling disruptive 1. INTRODUCTION innovation in real-time data processing, and (3) it provides In the past few decades we have been experiencing Moore's a very good performance per watt ratio. Law prophecy [1], which has resulted in a steady increase CBE performances were evaluated by defining an in the processors computing power. These improvements application benchmark, as well as developing a series of were boosted by the technological advances in synthetic micro-benchmarks. As benchmarks definition is microelectronics and miniaturization forecast by Gordon often controversial, our approach in defining the application Moore. In the past few years, however, due to (1) the benchmark to evaluate the CBE was rather pragmatic. We approaching limit on miniaturization, (2) the widening gap took under consideration two algorithms, the Rotational between microprocessors and memory speeds, and (3) the Motion Compensation (RMC)—a fundamental building diminishing performance returns resulting from clock block for all airborne real-time Synthetic Aperture Radars frequency increases, the steady growth in micro-processors (SAR) imaging; and the Space Time Adaptive Processing performance seems to have reached a saturation point. (STAP)—the holy grail of radar analysts'. Other than being In order to overcome this performance wall, extremely relevant for our application domain, both microprocessor architects have realized that instead of algorithms are computationally and memory bandwidth going faster, a sensible approach was to use the chip area to intensive, and are thus excellent candidates for stressing the exploit coarser parallelism rather than what already strengths of a processor. For both algorithms we also had an provided by instruction level parallelism and thread level existing implementation which helped us comparing results parallelism. This has lead to an architectural innovation in as well as speedups. contemporary processors architectures which has resulted Our initial experience, detailed in the remainder of this in the creation of multi-core microprocessors [3]. paper, has shown that multi-core processors such as CBE Multi-core architectures are creating the potential for a leap can provide speedup, on radar processing algorithms, of 20 forward in the processing capability made available by a and 30 times when compared with the technology typically single chip. This has a great potential for computationally used today, such as PPC G4 or TigerSHARC DSP, while at hungry applications such as, radar processing, image the same time allowing a reduction in volume and power
  • 2. roughly by an order of magnitude. Moreover, what really struck us was that these improvements, once the correct application partitioning is devised, are gained without too much programming effort, and in relatively little time. The reminder of the paper is organized as follows, Section 2 provides and overview of the IBM CBE; Section 3 describes the application benchmarks; Section 4 reports the performance results of the selected application benchmarks; finally Section 5 describe the future works and concluding remarks. 2. IBM CELL BROADBAND ENGINE (CBE) Figure 1 – IBM Cell Broadband Engine Architecture. Architectural Overview. The IBM Cell Broadband Engine is a heterogeneous multicore processor that, as shown in Figure 1, is composed by eight Synergistic Processing amounts of time. Elements (SPEs) and one 64-bit Power Processing Element (PPE). SPEs are 128-bit processor with a SIMD-RISC [2] instruction set and a unified register file of 128 registers, 3. RADAR PROCESSING BENCHMARK each of which 128-bit wide. The PPE is a 64-bit processor Benchmarks definition is often controversial as it takes a based on the PPC 970 architecture. good blend of art and science to define an objective These elements are interconnected to each other and to the benchmark. Our approach in defining the application main memory, by a ring-based bus, namely the Element benchmark to evaluated the CBE is rather pragmatic. We Interconnect Bus (EIB), capable of carrying up to 96 bytes took under consideration two algorithms, the Rotational per cycle. The PPE is capable of addressing directly the Motion Compensation (RMC)—a fundamental building main memory, all the other elements, i.e., the SPEs, access block for all airborne radars; and the Space Time Adaptive the main memory through DMA. SPEs are equipped with a Processing (STAP)—the holy grail of radar analysts'. Local Store (LS) of 256KByte which is used to store data Other than being extremely relevant for our application and code, and is under control of the programmer. domain, both algorithms are computationally and memory The CBE is able to deliver more than 200 GFLOPs when bandwidth intensive. For both algorithms we also had operating on single precision floating point, at a power of existing implementations optimized and tuned over the roughly 70W, providing an amazing GFLOP/Watt index. years on top of the class DSPs and micro-processors, which helped us comparing results as well as speedups. In the Programming the CBE. The CBE architecture has been reminder of this Section we provide a brief description of driven by the requirements of a wide set of application the two algorithms. domains such as computer gaming, multimedia stream processing, computer vision, data and signal processing, 3.1. ROTATIONAL MOTION COMPENSATION etc. As a result, although it might look at first harder to grasp and program, it fits application typical of this domain The Rotational Motion Compensation (RMC) algorithm is very well, leading to natural application design, typically used to remove the distortion induced in the partitioning, and implementation. The architectural choices measure caused airplane movement. Conceptually this at its foundation have traded performance and power algorithm is rather simple as it can be decomposed in FFT, efficiency with easy of programming. As a result, the CBE, IFFT, FFT-SHIFT and IFFT-SHIFT performed over either is a multicore for which to exploit its maximum potential the rows or the column of a complex matrix. In detail, given has to be programmed like a distributed system rather than an (N, M) complex matrix C, the algorithm was composed like multi-threaded system as in homogeneous multi-core by the following steps: processors. 1. For each row of C perform an FFT and SHIFT However, as we will see on the reminder of the paper the the elements by M/2 learning curve is not so steep, and it is possible to get up to 2. For every column k of C between 1 and M, speed with programming the CBE in relatively short extract the sub-matrix (N, 2K) centered on the kth
  • 3. column. Call this matrix S algorithm. In our STAP implementation we relied on the Cholesky decomposition, for positive define complex 3. For each rows of S perform the following matrices, for efficiently computing a decomposition which operations: requires a forward and a backward substitution in order to i. IFFT find the linear system solution. ii. SHIFT by K/2 elements 4. For each of the columns S perform the 4. EMPIRICAL EVALUATION following operations: Testbed Setup. The testebed on which we evaluated the i. FFT performance of RMC and STAP consisted of: ii. Vector Product with a fixed vector • Dual Cell Blade QS20, 3.2GHz with 1GB SDR iii. IFFT running Linux Fedora Core 5 with Cell SDK 2.0 iv. Accumulate all the S’ matrix columns • TigerSHARC DSP 500MHz • MPC 7457 featuring a PPC Power G4 v. Substitute the kth column of C with the sum obtained at the previous point 5. Multiply each column on the resulting (N, M) 4.1. RMC RESULTS matrix with a constant scalar. Execution Time. The RMC was carefully coded to fully 6. For each row of the resulting matrix perform the exploit data parallelism as well as processing parallelism. following operations: Then, the execution time was measured when running the algorithm for a 2048x1024 matrix with K=64. i. IFFT ii. SHIFT elements by M/2 The complexity of the algorithm depends on the size of the matrix which is defined by the couple (N, M), and by the shift factor K. 3.2. SPACE TIME ADAPTIVE PROCESSING Figure 2 – RMC execution time vs. SPU number. The Space Time Adaptive Processing (STAP) is a signal processing technique commonly used in radar systems. It involves adaptive array processing algorithms to aid in target detection. Radar signal processing benefits from STAP in areas where interference is a problem (i.e. ground clutter, jamming, etc.). Through careful application of Figure 2 shows how the RMC’s execution time depends on STAP, it is possible to achieve order-of-magnitude the number of SPUs on which the computation is sensitivity improvements in target detection. parallelized. As it can be easily seen from the figure, the measured execution time scales practically linearly within a STAP involves a two-dimensional filtering technique using single CBE chip, i.e., going from 1 to 8 SPUs. When a phased array antenna with multiple spatial channels. relying on two CBE, and thus using up to 16 SPUs, the Applying the statistics of the interference environment, an scaling is sublinear, but still very good--especially if we adaptive STAP weight vector is formed. This weight vector consider that the application was not optimized for the dual is applied to the coherent samples received by the radar. cell configuration. By tuning the application for the dual From a numerical perspective, determining the STAP filter CBE configuration, and minimizing the inter-CBE vector requires, among other things, solving a linear communication we are confident that the latencies that lead system. Fixed the problem space, the linear system solution to sub-linear scaling could be completely hidden. is the computation that dominates the execution time of the Speedup. The execution time of an RMC algorithm coded
  • 4. and optimized for a TigerSHARC was evaluated and compared with that of the CBE counterpart. The speedup was measured and is reported in Figure 3. This figure shows how a single SPU is almost 4 times faster than a TigerSHARC DSP, while exploiting on the full power of the CBE (8 SPUs) leads to a 26x speedup. The dual CBE configuration provides a 40x speedup, which as discussed above could be further improved by making the application dual-CBE aware. Figure 5 – STAP slowdown w.r.t. the ideal execution time saturates and the execution time scaling flattens out. Measurements on the used memory bandwidth revealed that the limited speedup when more than 3 SPUs are used was due to memory bandwidth saturation. Figure 3 – CBE/TigerSHARC speedup. :6//"85;"<= that wouldpercentage with respect to Figure 5 reports the slowdown the linear scaling #>? be ideally desirable to experience. From Figure 5 it is easy to see how the highest slowdown are experienced with large matrices and with more than 4 SPU. @67AB7CDEF6":())>@67ABCDEF6"G@:"H=IH %" %! @67AB7CDEF6"7D41B $" $! +,--.#./01 +,--.&./012 #" +,--.(./012 Figure 4 – STAP Normalized Execution Time #! " ! # $ % & " ' ( ) * #! ## #$ JKCL67"BA"7BM5 4.2. STAP RESULTS !"#$%&'%()*+%,)*%-.,/"0%1"2*%"+3/*.1*1%,)*%4*/56/-.+3*%/.,"6%7*,8**+%9:;;%.+<%G@:"H=II% "+3/*.1*1%,66%=+,">%",%1.,=/.,*1%",1%-*-6/?%7.+<8"<,)$ Figure 6 –CBE Speedup over PPC Power G4 (number of Execution Time. As it was explained earlier in the rows scaled by 100). paper ,the dominant portion of the STAP algorithm is the !"#$$%&"'()(*+',"-".//"012345"06567869 linear system solution. The problem with this is that Speedup. We evaluated the performance of the STAP algorithms for solving linear systems have many control algorithm over an MPC 7457 board featuring a PPC Power and data dependencies that limit the amount of computation G4. Figure 6 shows the speedup experienced for matrix of that can be carried in parallel. The implementation of the sizes ranging from (100, 100) to (1200, 1200), when using Cholesky decomposition we crafted for the CBE was very 1, 4, and 7 SPU respectively. The results show that the careful in extracting all the available parallelism, especially speedup consistently increases with the size of the matrix, at the data level. and as an example a single SPU is 3x faster than a G4 for Figure 4 shows the normalized execution time for the (100,100) matrices, but 10x for (1200, 1200). Moreover, the STAP for covariance matrices of size ranging from (100, speedup can be as much as 30x. 100) to (1024, 1024). As it can be seen from the graphics, the execution time scales linearly when the number of SPUs does not exceed 3. With more than 3 SPUs the system
  • 5. 5. CONCLUDING REMARKS disruptive when compared with the technology commonly The IBM Cell Broadband Engine is a new multicore used today. processor that has been applied with great success in the REFERENCES context of game consoles such as the Play Station 3. It’s [1] G. E. Moore, “Cramming More Components onto Integrated architecture fits very well with the kind of workloads, as Circuits”, Electronics, vol. 38, n. 8, April 1965. well as the computational structure of problems common in [2] J. L. Hennessy, D. A. Patterson, “Computer Architecture: A Quantitative Approach”, 4th ed., Morgan Kaufmann, 2006. data and signal processing, thus making this processor an [3] J. L. Hennessy, D. A. Patterson, “A Conversation with John ideal solution for application in this domain. Initial Hennessy and David Patterson”, in ACM Queue vol. 4, n. 10, January 2007. benchmarking results shown in this paper confirm that the [4] IBM Cell Project, http://www.research.ibm.com/cell/ level of performance that can be achieved with the CBE are