March 25, 2006 Reconfigurable Computing Dr. Partha Pratim Das Head of Engineering, Interra Systems (India) Pvt. Ltd.   Emerging Architectures for Embedded Systems
In memory of … Ben Sloman, 1967 - 2002 Founder Vice President of  Corporate Development &  Software Engineering   Elixent Ltd., UK. In 2001, Ben introduced me to the wonderful world of Reconfigurable Computing.
Source & Disclaimer Information about the Paradigms, Architectures, Applications, Tools, Capacity, Advantages & Scope of various RC Architectures have been borrowed from the respective sites of the companies. The speaker bears no responsibility for their correctness. Neither does he promote or demote any specific company on the merit or otherwise of their technology.
Outline Settings the Stage –  Why Reconfigurable Computing? Reconfigurable Computing –  Leading Companies Reconfigurable Computing –  Case Study of New Computing Machines D-Fabrix – based on Reconfigurable Array Architecture ACM – based on SRGA XPP – based on Dataflow Computing PulseDSP™ – based on Systolic Computing Programming RAP –  Overview of Development Tools Sum Up Questions
Setting the Stage What are Embedded Systems? & Why Reconfigurable Computing for them?
What are Embedded Systems? Any device that  includes  a Computer but  is not itself  a General-Purpose Computer Computers as Components
A Perfect Example!
Other Examples Personal Convenience Calculator Alarm (Talking) Clock Radio CD / MP3 Player Personal Digital Assistance (PDA) Photo Copier Personal Communication Cordless Phone Answering Machine Cell Phone Fax Public Utilities  Automatic Teller Machine Electronic Voting Machine Camera Analog Digital Handy Cam Automobile Engine Brake Dash Car Stereo Aviation Television Analog TV: Channel Selection Digital TV: Decompression CAS: De-scrambling Household Appliances Microwave Oven Washing Machine Air-conditioner Surveillance Systems Burglar Alarm CCTV Metal Detector Biometric Identification System Secure ID Control & Automation Railway Signaling Steel Industry – Blast Furnace Aluminum Extraction Fire Alarm Industrial Process Control Medical Systems Pace Maker Laparoscopic Appliances Monitors – ECG, EEG, PET  Computer Accessories Printer Plotter Scanner Networking NIC Cards N/w Components – HUB, Router, Switch Modem Global Positioning System Navigation Exploration
Characteristics of Embedded Systems Real-Time Operation (always ?) Low Manufacturing Cost Low Power Universal  & Market Driven Sophisticated Functionality Application Dependent Processor Restricted Memory Fault Tolerant Safe Domain Specific  & Technology Driven
Embedded Systems Market Segments Source:  The Death of the DSP  by Nick Tredennick www.qstech.com , August 2000
The zero-cost segment To a first approximation represents almost all of the embedded systems market The segment for which low cost is the overriding consideration.  Consumer appliances that generally have minimal processing needs microwave ovens, electric razors, blenders, toasters, washing machines, … Sells in high volumes (millions of units to tens of millions of units).  Characterized by intense price competition - ideal would be zero cost to implement.
The zero-power segment To a first approximation represents a few percent of the embedded systems market The segment for which zero power dissipation represents the ideal. Consumer items that are expected to run on a single button-size battery or on weak ambient light smoke detectors, basic cellular phones, pagers, pacemakers, hearing aids, MP3 players, pocket calculators, etc.  Minimum product cost remains a concern.
The zero-delay segment To a first approximation represents a little more than zero percent of the embedded systems market The segment for which zero delay from data in to result out represents the ideal.  Consumer items high-end printers, scanners, copiers, and fax machines,  Processing power and throughput are important Minimum product cost is still the criteria
The zero-volume segment To more than a first approximation, represents zero percent of the embedded systems market The segment for which the application potential is nearly zero.     production volumes and profits will also be close to zero Why Intel did design 80960MX microprocessor? The only known application was the YF-22 aircraft.  Later the only prototype of the YF-22 crashed & the application volume for the ’960MX actually went to zero. Could Intel have expected to sell more than a few thousand ’960MX processors? There must be some other reason to capture the application.  One motive is public relations.
YF-22
The Leading Edge Wedge
The Leading Edge Wedge Handheld devices  digital cameras, mobile phones, GPS receivers, PDAs, etc. drive more computing into portable devices.  Being consumer devices, they fall into the  zero-cost segment .  Having high computing requirements, they fall into the  zero-delay segment .  Being portable, they fall into the  zero-power segment .  Target:  cheap, highly capable devices that give us instant answers and that work on weak ambient light.   The overlap of the zero-cost, zero-delay, and zero-power segments is  the leading-edge wedge .
(Mobile) Technology Road Map
Approach to Computing TASK : First we think of some task or function that we wish to perform ALGORITHM : Next we define an algorithm that describes how to perform the task MAP : Ultimately we map our algorithm into some kind of physical implementation that will execute the task
Computing Models ASIC  –Functionality fixed during fab Custom SoC Structured ASIC FPGA  – Programmable functionality Embedded Processor Core Embedded Custom ASIC (peripherals)  P  – General Purpose programmable device DSP  – Specialized   P
Limitations Of ASICs Prolonged design cycle High NRE Algorithm is frozen into h/w – no flexibility Each function needs its own implementation in h/w – more area and more power  Design based on HDLs – not good at representing algorithms
Limitations of FPGAs Reconfiguration is slow Reconfiguration is power consuming Inefficient use of available logic Often leads to combinatorial explosion Good for Rapid Prototyping Design based on HDLs – not good at representing algorithms
Limitations Of DSP/  P   Algorithm has to be artificially partitioned  Constrained to meet the physical bus width  Mapped on the instruction set of the target device Inefficient utilization of available resources Loss of inherent parallelism of the algorithm
Summary Observations Algorithm-friendly design tools Low Power Re-configurability 100,000 times / sec Dynamic Dynamic RC General purpose h/w – fixed & inefficient Algorithms changeable but artificially partitioned & constrained to match h/w Pseudo-Dynamic Rigid  P / DSP Power-hungry Slow to re-configure Inefficient design tools Rigid Pseudo-Dynamic FPGA The h/w is fixed.  The algorithm frozen in h/w.  Design tools not good for algorithms.  Rigid Rigid ASIC Remarks Algorithms Hardware Resources
Search for New Machines Should be Low Power Should be Low Area Should keep pace with evolving standards Should fit the algorithms – Implement efficiently  Should be Low cost to design Should be Fast to market Dynamic algorithms  implemented on  dynamic resources
Reconfigurable Architecture Requirements Scalable hardware-like performance Massive instruction-level parallelism Balance between compute, memory and communication Tunable number and type of resources Software-like flexibility Support multiple applications Commit function any time after silicon fabrication Fast function changes for multi-mode products Silicon efficiency Price/performance overhead must be low
Alternate Nomenclature Reconfigurable Computing Configurable Computing Reconfigurable Array Architecture Self Reconfiguring Architecture Adaptive Computing Machine Reconfigurable Algorithmic Process (RAP) Dataflow Computing
Reconfigurable Computing Leading Companies
FPGA Leaders Xilinx  www.xilinx.com   Altera  www.altera.com Actel  www.actel.com Quicklogic www.quicklogic.com Lattice www.latticesemi.com In a way an FPGA may also be an RC architecture. Reconfiguration in FPGA is Slow, Power-hungry and Less flexible. In this lecture we would stay away from reviewing FPGA companies & technologies while we look for more dynamic architectural options.
RC Runners Elixent  www.elixent.com   QuickSilver  www.qstech.com   Pact Corp www.pactcorp.com   Systolix www.systolix.co.uk   (WOS of RadioScape from Jan, 2002) Xilinx www.xilinx.com   (acquired Triscend in Mar, 2004) Pico Chip www.picochip.com   Let us first review these companies before taking up case studies for some of their architectures
Elixent Ltd. Supplies Reconfigurable Algorithmic Processors (RAP) emerging IP business space  significant competitive edge for  1st tier OEMs,  Application Specific Standard Part vendors and  IC integrators.  Elixent technology addresses the top three customer needs: Increased functionality  Reduced Time To Market  Reduced Design Costs
Elixent Ltd. UK based Company Offices in UK, US & Japan Spin-off from HP Research Laboratories Investors VC Firm 3i Hewlett Packard Actel Partners Interra – HDL entry Celoxica – Handel-C entry AccelChip – MATLAB entry Others in the DSP Design Space
Quick Silver Technology Adaptive Computing Machine (ACM) Functionality will adapt – on-the-fly – by downloading s/w applications Single device can perform various media rich apps Has its own language/design tools  Silverware Strongly suggests that HDLs are not the way to go could be marketing stunt for non-availability of HDL solution
Quick Silver Mobile Communication company Founded 1998 San Jose based company  Offices in San Diego, Seattle, U.K. and Japan Got 13 Million new funding in April, 2002 Investors:  TechFund Capital JP Morgan Partners Portview Communication Partners Selby Venture Partners Bellsouth cellural Kyocera Corp
Pact Corp Technology eXtreme Processing Platform (XPP) Set of ALU, RAM, I/O elements along with Configuration manager Mix of C-Subset and Native Mapping Language(NML) Have a C-compiler (XPP-VC) Simulation and development tools around C and NML NML is a sort of structural language High level of components are there like Counters
Pact Corp Fabless semiconductor and IP vendor Offers IP and ASSPs based on its XPP architecture Simulation and Development tools available Corporate office in Germany Sales/Marketing office in San Jose  Recently teamed with Quicklogic Have sound funding
PulseDSP™: Systolix Highly Scaleable Architecture Multiple Data Widths, 8 to 64 bits internally Multiple Array Sizes, 32 to 14,000 processing elements  Multiple I/O Ports, up to 52 channels at 200MSPS each  Sustained performance of up to 200GMAC/S for 16bit operations  Real-time signal processing up to video rates  Supports multiple independent signal data streams  Integrated control and data processing functions  Supports Linear and Non-Linear systems  Dynamically or Statically programmed  Full application development environment
Systolix PulseDSP Ltd A wholly owned subsidiary of   RadioScape  Ltd .   Founded  in 1998 and is based in Liverpool, UK.  Acquired by  RadioScape  in Jan 2002. Specializes  in the development and commercial licensing of advanced DSP technologies and associated software tools.  Introduced PulseDSP  technology  A multiprocessor architecture that provides very low cost, high performance programmable DSP.  PulseDSP  technology is ideal for high MAC rates and  rapid data throughput  Example – digital IF, baseband signal processing, software radio solutions.  Licensed  the first PulseDSP technology to Analog Devices.
Reconfigurable Architectures Case Study of New Computing Machines
Case Study D-Fabrix Array ACM – Adaptive Computing Machine Based on SRGA – Self-Reconfigurable Gate Array Architecture XPP – eXtreme Processing Platform PulseDSP™ – Systolic Architecture
D-Fabrix Array Elixent Source : DFA1000 RISC Accelerator Data Sheet & Website  www.elixent.com
D-Fabrix Array Massive instruction-level-parallelism  Regular tiled structure:  low design cost,  configurable,  portable Rapid re-configuration  
D-Fabrix Array Components are: 4-bit ALUs,  registers and  the "switchbox".   
Basic D-Fabrix Array Element ALU 4 4 1 Typical instructions:   A + B  Cin ? A:B A - B  A == B A & B  A > B A | B  not A A xor B  not B INSTR 4 A B C IN /CONTROL C OUT 1 4 F Output register options:   Transparent Reset:  0000, 1111   Clocked: always, when enabled, never REG
D-Fabrix Array: Tile Combine two of each into the "tile".  
D-Fabrix Array: Tiling & Memory Combine 100’s or 1000’s of tiles to create the D-Fabrix array.  Memory is distributed to give fast, local storage with massive bandwidth.   
D-Fabrix: Routing 16  4-bit busses cross each ALU horizontally and vertically  for short and long connections M 4-bit connections are made by setting a configuration bit to ‘1’ Each ALU connects to 8 others via just one switch delay Under 128 configuration bits per ALU+switchbox => Can configure 512 ALUs (64 kbits) in ~20  s
D-Fabrix Array: Virtual Hardware Once the math units are in place, the switchboxes link them together. They are part of a rich interconnect, providing both local and global connectivity.  The algorithm implemented in "Virtual Hardware", it's being processed on a hardware implementation. But, it's software.  A new set of "virtual hardware" can be loaded at any time. In microseconds you can switch to the next hardware configuration.  Some applications have one configuration, and only alter that when the standards change, or the specification creeps.  Others re-use the silicon dynamically, switching modes, or even "folding" algorithms to use smaller arrays.   
DFA 1000: D-Fabrix based RISC Accelerator
DFA 1000: D-Fabrix based RISC Accelerator Pre-configured D-Fabrix implementation for accelerating RISC Systems  A peripheral set to facilitate its integration into SOC designs High-speed data interfaces to the D-Fabrix core array low latency and  no overhead on the system bus. The AMBA bus interface Programming the array,  Transferring data to and from the host RISC Much lower bandwidth control and configuration path. Local high-speed RAMs, directly accessible by the array or by the RISC;  D-Fabrix array itself.
DFA 1000: Advantages Over FPGAs An order of magnitude better in die size for a given performance Retains homogeneous architecture – the whole array can be utilized for any given task Over DSPs An order of magnitude improvement in most algorithms, simply by matching the computing to the algorithm Over RISCs RISC architectures are not typically optimized for the dataflow algorithms common in DSP and media processing
DFA 1000: Applications & Benchmarks 200Mpixel/second (two macro-blocks in parallel) JPEG Encoder 400Mpixels/sec (four 8x8 DCTs in parallel)  8x8 DCT (16 image lines in parallel)  Dither 400Mpixels/sec Floyd-Steinberg Color ~1024 Voice channels  UMTS Viterbi 400Msample/sec  5th Order CIC Filter
DFA 1000: Imaging Application one port captures data,  the second displays it, the AHB is used for control
DFA 1000:  Software Defined Radio Application I/O ports used to transfer data to the antenna AHB used for data output to the host RISC.
ACM – Adaptive Computing Machine Based on   SRGA – Self-Reconfigurable Gate Array Architecture QuickSilver Source:  A Self-Reconfigurable Gate Array Architecture ,  Reetinder Sidhu et al,  10th International Workshop on Field Programmable Logic and Applications, August 2000 . A look into QuickSilver's ACM architecture,  Paul Master, CTO QuickSilver, EE Times, September 12, 2002 (4:39 p.m. EST) Website  www.qstech.com
Self Reconfigurable Gate Array Architecture Logic adapts itself based on computation proceeds, based on input and intermediate results Device needs to store multiple contexts of configuration and context switch between them Self Reconfiguration by modifying the configuration memory
SRGA Architecture A Self Reconfigurable device characterized having the following features: Fast Context Switching Fast Random Access of Configuration Memory  Efficient Architecture should allow single cycle context switching as well as single cycle random memory access
SRGA Architecture Consists of a rectangular gate  array of PEs Each PE consists of a logic cell and memory block Logic cell contains LUT and a flip-flop - Each PE connected to neighboring PEs and switches
SRGA Architecture A configuration context contains bits that configures all the logic cells and switches in the mesh of trees network Configuration context stored in memory blocks Memory access operation transfers data between rows and columns of PE Each memory block is implemented as random access memory that can read/write single bit every clock cycle
SRGA:  Context Switch Operation For a context switch to occur, some logic on the currently active context needs to write into a specified memory the address of the context to switch to  In each memory block, the configuration bits are loaded into in the first half of the next clock cycle During the second half of the next clock cycle the current context is saved
ACM:  Adaptive Computing Machine An SRGA based Heterogeneous Architecture Five types of nodes:  Arithmetic  – different, variable width, linear arithmetic functions like a FIR filter, a DCT, an FFT Bit manipulation  – different, variable-width bit-manipulation functions like LFSR, Walsh code generator, GOLD code generator, TCP/IP packet discriminator Finite state machine ,  Scalar  – execute legacy code Configurable input/output  – I/O in the form of a UART or bus interfaces such as PCI, USB, Firewire and other I/O-intensive actions
ACM: Fractal Architecture
ACM:  Advantages Any node can be adapted to perform a new function, clock cycle by clock cycle. Rather than passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock cycle-by-clock cycle basis . Adaptable hundreds of thousands of times a second Portions of an algorithm that are actually being executed need to be resident in the chip at any one time.  Heavy Silicon reuse  Tremendous reductions in silicon area & power consumption.
XPP: eXtreme Processing Platform Pact Corp Source:  The XPP White Paper www.pactcorp.com
XPP: eXtreme Processing Platform The XPP idea consists of: Data stream processing Configurable ALUs communicating via a packet oriented, automatically synchronized communication network User transparent configuration management Works in the Dataflow Computing Paradigm
XPP: How it Works
XPP: Configurations Configurations are basic parallel calculation modules which are derived from a data flow graph of an algorithm.  Nodes of the data flow graph are mapped to the fundamental machine operations such as multiplication, addition etc.  Graph’s edges are the connections between the nodes.  As long as data packets stream through a single configuration, the graph remains static - no opcodes and connections are changed.
XPP: Configurations – Vector * Matrix
XPP:  Configuration Flow: Decoupling of data processing & configuration Replace  Von-Neumann instruction stream by  a configuration stream Process   streams of data instead of  single machine words .
PulseDSP™ Systolix Source : Website  www.systolix.com
PulseDSP™ A radical new approach to implementing programmable signal processing functions.  Comes from Systolix's. Extraordinary performance  High level of flexibility  Very low manufacturing cost.
PulseDSP™: Architecture A large number of highly efficient processors arranged in a "systolic array"  Numerical data is pumped around the processors in the same way that blood is pumped around the body.  Massive Parallelism helps exploit the parallelism in most DSP algorithms.  Many billions of multiply accumulate calculations can be performed every second. 
PulseDSP™: Cells Instructions and data are all held locally - hence no memory bottlenecks.  Inherently synchronous Guaranteed Performance Application independent.  Simplifies and speeds development Enables the designer to make full use of the processing power available.
PulseDSP™: Tools Directly maps the original algorithm signal flow to the PulseDSP array User does not need to have any understanding of the underlying architecture.  The compiler takes full advantage of the inherent parallelism in DSP algorithms to provide true parallel processing solution.  The PulseDSP architecture doesn't just provide two or three parallel MAC units – it provides thousands, all fully utilized! 
PulseDSP™: Summary Highly Scaleable Architecture       - Multiple Data Widths, 8 to 64 bits internally       - Multiple Array Sizes, 32 to 14,000 processing elements        - Multiple I/O Ports, up to 52 channels at 200MSPS each  Sustained performance of up to 200GMAC/S for 16bit operations  Real-time signal processing up to video rates  Supports multiple independent signal data streams  Integrated control and data processing functions  Supports Linear and Non-Linear systems  Dynamically or Statically programmed  Full application development environment
Programming RAP Overview of Development Tools Source : Websites of Respective Companies
Summary of Tools CGN16XXX Data Flow Description / C Intelligent Network Processor Software Development Environment (SDE) Cognigine picoArray Signal Flow Description / Embedded C picoTools picoChip XPP Core / NML / System C C, NML (Native Mapping Language) XDS – Software Development Suite, XPP-VC –Vectorizing C-Compiler Pact Corp ACM C + temporal & spatial extensions. Silverware QuickSilver A7S CSoC VHDL / Verilog / Schematic Capture FastChip Development System Triscend PulseDSP cores Schematic Capture Systolix Design System (SDS) Systolix D-Fabrix Verilog / VHDL / Handel-C / MATLAB D-Sign Elixent Target Entry Tools Company
D-Sign: Elixent NOM  Generator NOM2EOM Converter Synthesis Physical  RDA Generator AIM  Lib ADM Lib Concorde  Macros Optimisations Verilog /  VHDL  Design De-Compiler Intermediate Design in Verilog / VHDL / XML IDM Lib Library Maker Arch-I Adaptor Bit-twiddling Nibble align AIM ADM IDM PRO rules
XPP:  Applications Development The XPP development suite provides program development and debugging support.  XPP-VC can perform  instruction level parallelism pipelining  automatic resource management  multi-threading
XDS – XPP Dev. Tool: Pact Corp
XPP-VC – XPP C Compiler: Pact Corp
Sum Up Discussed the issues in Architectural Choice for Embedded Algorithm Implementation Reviewed a few Reconfigurable Architectures Data Flow Paradigm – Petri Nets Systolic Arrays Configuration Streaming SIMD Looked at Tools Availability
Questions ?
Thank You

Reconfigurable Computing

  • 1.
    March 25, 2006Reconfigurable Computing Dr. Partha Pratim Das Head of Engineering, Interra Systems (India) Pvt. Ltd. Emerging Architectures for Embedded Systems
  • 2.
    In memory of… Ben Sloman, 1967 - 2002 Founder Vice President of Corporate Development & Software Engineering Elixent Ltd., UK. In 2001, Ben introduced me to the wonderful world of Reconfigurable Computing.
  • 3.
    Source & DisclaimerInformation about the Paradigms, Architectures, Applications, Tools, Capacity, Advantages & Scope of various RC Architectures have been borrowed from the respective sites of the companies. The speaker bears no responsibility for their correctness. Neither does he promote or demote any specific company on the merit or otherwise of their technology.
  • 4.
    Outline Settings theStage – Why Reconfigurable Computing? Reconfigurable Computing – Leading Companies Reconfigurable Computing – Case Study of New Computing Machines D-Fabrix – based on Reconfigurable Array Architecture ACM – based on SRGA XPP – based on Dataflow Computing PulseDSP™ – based on Systolic Computing Programming RAP – Overview of Development Tools Sum Up Questions
  • 5.
    Setting the StageWhat are Embedded Systems? & Why Reconfigurable Computing for them?
  • 6.
    What are EmbeddedSystems? Any device that includes a Computer but is not itself a General-Purpose Computer Computers as Components
  • 7.
  • 8.
    Other Examples PersonalConvenience Calculator Alarm (Talking) Clock Radio CD / MP3 Player Personal Digital Assistance (PDA) Photo Copier Personal Communication Cordless Phone Answering Machine Cell Phone Fax Public Utilities Automatic Teller Machine Electronic Voting Machine Camera Analog Digital Handy Cam Automobile Engine Brake Dash Car Stereo Aviation Television Analog TV: Channel Selection Digital TV: Decompression CAS: De-scrambling Household Appliances Microwave Oven Washing Machine Air-conditioner Surveillance Systems Burglar Alarm CCTV Metal Detector Biometric Identification System Secure ID Control & Automation Railway Signaling Steel Industry – Blast Furnace Aluminum Extraction Fire Alarm Industrial Process Control Medical Systems Pace Maker Laparoscopic Appliances Monitors – ECG, EEG, PET Computer Accessories Printer Plotter Scanner Networking NIC Cards N/w Components – HUB, Router, Switch Modem Global Positioning System Navigation Exploration
  • 9.
    Characteristics of EmbeddedSystems Real-Time Operation (always ?) Low Manufacturing Cost Low Power Universal & Market Driven Sophisticated Functionality Application Dependent Processor Restricted Memory Fault Tolerant Safe Domain Specific & Technology Driven
  • 10.
    Embedded Systems MarketSegments Source: The Death of the DSP by Nick Tredennick www.qstech.com , August 2000
  • 11.
    The zero-cost segmentTo a first approximation represents almost all of the embedded systems market The segment for which low cost is the overriding consideration. Consumer appliances that generally have minimal processing needs microwave ovens, electric razors, blenders, toasters, washing machines, … Sells in high volumes (millions of units to tens of millions of units). Characterized by intense price competition - ideal would be zero cost to implement.
  • 12.
    The zero-power segmentTo a first approximation represents a few percent of the embedded systems market The segment for which zero power dissipation represents the ideal. Consumer items that are expected to run on a single button-size battery or on weak ambient light smoke detectors, basic cellular phones, pagers, pacemakers, hearing aids, MP3 players, pocket calculators, etc. Minimum product cost remains a concern.
  • 13.
    The zero-delay segmentTo a first approximation represents a little more than zero percent of the embedded systems market The segment for which zero delay from data in to result out represents the ideal. Consumer items high-end printers, scanners, copiers, and fax machines, Processing power and throughput are important Minimum product cost is still the criteria
  • 14.
    The zero-volume segmentTo more than a first approximation, represents zero percent of the embedded systems market The segment for which the application potential is nearly zero.  production volumes and profits will also be close to zero Why Intel did design 80960MX microprocessor? The only known application was the YF-22 aircraft. Later the only prototype of the YF-22 crashed & the application volume for the ’960MX actually went to zero. Could Intel have expected to sell more than a few thousand ’960MX processors? There must be some other reason to capture the application. One motive is public relations.
  • 15.
  • 16.
  • 17.
    The Leading EdgeWedge Handheld devices digital cameras, mobile phones, GPS receivers, PDAs, etc. drive more computing into portable devices. Being consumer devices, they fall into the zero-cost segment . Having high computing requirements, they fall into the zero-delay segment . Being portable, they fall into the zero-power segment . Target: cheap, highly capable devices that give us instant answers and that work on weak ambient light. The overlap of the zero-cost, zero-delay, and zero-power segments is the leading-edge wedge .
  • 18.
  • 19.
    Approach to ComputingTASK : First we think of some task or function that we wish to perform ALGORITHM : Next we define an algorithm that describes how to perform the task MAP : Ultimately we map our algorithm into some kind of physical implementation that will execute the task
  • 20.
    Computing Models ASIC –Functionality fixed during fab Custom SoC Structured ASIC FPGA – Programmable functionality Embedded Processor Core Embedded Custom ASIC (peripherals)  P – General Purpose programmable device DSP – Specialized  P
  • 21.
    Limitations Of ASICsProlonged design cycle High NRE Algorithm is frozen into h/w – no flexibility Each function needs its own implementation in h/w – more area and more power Design based on HDLs – not good at representing algorithms
  • 22.
    Limitations of FPGAsReconfiguration is slow Reconfiguration is power consuming Inefficient use of available logic Often leads to combinatorial explosion Good for Rapid Prototyping Design based on HDLs – not good at representing algorithms
  • 23.
    Limitations Of DSP/ P Algorithm has to be artificially partitioned Constrained to meet the physical bus width Mapped on the instruction set of the target device Inefficient utilization of available resources Loss of inherent parallelism of the algorithm
  • 24.
    Summary Observations Algorithm-friendlydesign tools Low Power Re-configurability 100,000 times / sec Dynamic Dynamic RC General purpose h/w – fixed & inefficient Algorithms changeable but artificially partitioned & constrained to match h/w Pseudo-Dynamic Rigid  P / DSP Power-hungry Slow to re-configure Inefficient design tools Rigid Pseudo-Dynamic FPGA The h/w is fixed. The algorithm frozen in h/w. Design tools not good for algorithms. Rigid Rigid ASIC Remarks Algorithms Hardware Resources
  • 25.
    Search for NewMachines Should be Low Power Should be Low Area Should keep pace with evolving standards Should fit the algorithms – Implement efficiently Should be Low cost to design Should be Fast to market Dynamic algorithms implemented on dynamic resources
  • 26.
    Reconfigurable Architecture RequirementsScalable hardware-like performance Massive instruction-level parallelism Balance between compute, memory and communication Tunable number and type of resources Software-like flexibility Support multiple applications Commit function any time after silicon fabrication Fast function changes for multi-mode products Silicon efficiency Price/performance overhead must be low
  • 27.
    Alternate Nomenclature ReconfigurableComputing Configurable Computing Reconfigurable Array Architecture Self Reconfiguring Architecture Adaptive Computing Machine Reconfigurable Algorithmic Process (RAP) Dataflow Computing
  • 28.
  • 29.
    FPGA Leaders Xilinx www.xilinx.com Altera www.altera.com Actel www.actel.com Quicklogic www.quicklogic.com Lattice www.latticesemi.com In a way an FPGA may also be an RC architecture. Reconfiguration in FPGA is Slow, Power-hungry and Less flexible. In this lecture we would stay away from reviewing FPGA companies & technologies while we look for more dynamic architectural options.
  • 30.
    RC Runners Elixent www.elixent.com QuickSilver www.qstech.com Pact Corp www.pactcorp.com Systolix www.systolix.co.uk (WOS of RadioScape from Jan, 2002) Xilinx www.xilinx.com (acquired Triscend in Mar, 2004) Pico Chip www.picochip.com Let us first review these companies before taking up case studies for some of their architectures
  • 31.
    Elixent Ltd. SuppliesReconfigurable Algorithmic Processors (RAP) emerging IP business space significant competitive edge for 1st tier OEMs, Application Specific Standard Part vendors and IC integrators. Elixent technology addresses the top three customer needs: Increased functionality Reduced Time To Market Reduced Design Costs
  • 32.
    Elixent Ltd. UKbased Company Offices in UK, US & Japan Spin-off from HP Research Laboratories Investors VC Firm 3i Hewlett Packard Actel Partners Interra – HDL entry Celoxica – Handel-C entry AccelChip – MATLAB entry Others in the DSP Design Space
  • 33.
    Quick Silver TechnologyAdaptive Computing Machine (ACM) Functionality will adapt – on-the-fly – by downloading s/w applications Single device can perform various media rich apps Has its own language/design tools Silverware Strongly suggests that HDLs are not the way to go could be marketing stunt for non-availability of HDL solution
  • 34.
    Quick Silver MobileCommunication company Founded 1998 San Jose based company Offices in San Diego, Seattle, U.K. and Japan Got 13 Million new funding in April, 2002 Investors: TechFund Capital JP Morgan Partners Portview Communication Partners Selby Venture Partners Bellsouth cellural Kyocera Corp
  • 35.
    Pact Corp TechnologyeXtreme Processing Platform (XPP) Set of ALU, RAM, I/O elements along with Configuration manager Mix of C-Subset and Native Mapping Language(NML) Have a C-compiler (XPP-VC) Simulation and development tools around C and NML NML is a sort of structural language High level of components are there like Counters
  • 36.
    Pact Corp Fablesssemiconductor and IP vendor Offers IP and ASSPs based on its XPP architecture Simulation and Development tools available Corporate office in Germany Sales/Marketing office in San Jose Recently teamed with Quicklogic Have sound funding
  • 37.
    PulseDSP™: Systolix HighlyScaleable Architecture Multiple Data Widths, 8 to 64 bits internally Multiple Array Sizes, 32 to 14,000 processing elements Multiple I/O Ports, up to 52 channels at 200MSPS each Sustained performance of up to 200GMAC/S for 16bit operations Real-time signal processing up to video rates Supports multiple independent signal data streams Integrated control and data processing functions Supports Linear and Non-Linear systems Dynamically or Statically programmed Full application development environment
  • 38.
    Systolix PulseDSP LtdA wholly owned subsidiary of RadioScape Ltd . Founded in 1998 and is based in Liverpool, UK. Acquired by RadioScape in Jan 2002. Specializes in the development and commercial licensing of advanced DSP technologies and associated software tools. Introduced PulseDSP technology A multiprocessor architecture that provides very low cost, high performance programmable DSP. PulseDSP technology is ideal for high MAC rates and rapid data throughput Example – digital IF, baseband signal processing, software radio solutions. Licensed the first PulseDSP technology to Analog Devices.
  • 39.
    Reconfigurable Architectures CaseStudy of New Computing Machines
  • 40.
    Case Study D-FabrixArray ACM – Adaptive Computing Machine Based on SRGA – Self-Reconfigurable Gate Array Architecture XPP – eXtreme Processing Platform PulseDSP™ – Systolic Architecture
  • 41.
    D-Fabrix Array ElixentSource : DFA1000 RISC Accelerator Data Sheet & Website www.elixent.com
  • 42.
    D-Fabrix Array Massiveinstruction-level-parallelism Regular tiled structure: low design cost, configurable, portable Rapid re-configuration  
  • 43.
    D-Fabrix Array Componentsare: 4-bit ALUs, registers and the "switchbox".  
  • 44.
    Basic D-Fabrix ArrayElement ALU 4 4 1 Typical instructions: A + B Cin ? A:B A - B A == B A & B A > B A | B not A A xor B not B INSTR 4 A B C IN /CONTROL C OUT 1 4 F Output register options: Transparent Reset: 0000, 1111 Clocked: always, when enabled, never REG
  • 45.
    D-Fabrix Array: TileCombine two of each into the "tile".  
  • 46.
    D-Fabrix Array: Tiling& Memory Combine 100’s or 1000’s of tiles to create the D-Fabrix array. Memory is distributed to give fast, local storage with massive bandwidth.  
  • 47.
    D-Fabrix: Routing 16 4-bit busses cross each ALU horizontally and vertically for short and long connections M 4-bit connections are made by setting a configuration bit to ‘1’ Each ALU connects to 8 others via just one switch delay Under 128 configuration bits per ALU+switchbox => Can configure 512 ALUs (64 kbits) in ~20  s
  • 48.
    D-Fabrix Array: VirtualHardware Once the math units are in place, the switchboxes link them together. They are part of a rich interconnect, providing both local and global connectivity. The algorithm implemented in "Virtual Hardware", it's being processed on a hardware implementation. But, it's software. A new set of "virtual hardware" can be loaded at any time. In microseconds you can switch to the next hardware configuration. Some applications have one configuration, and only alter that when the standards change, or the specification creeps. Others re-use the silicon dynamically, switching modes, or even "folding" algorithms to use smaller arrays.  
  • 49.
    DFA 1000: D-Fabrixbased RISC Accelerator
  • 50.
    DFA 1000: D-Fabrixbased RISC Accelerator Pre-configured D-Fabrix implementation for accelerating RISC Systems A peripheral set to facilitate its integration into SOC designs High-speed data interfaces to the D-Fabrix core array low latency and no overhead on the system bus. The AMBA bus interface Programming the array, Transferring data to and from the host RISC Much lower bandwidth control and configuration path. Local high-speed RAMs, directly accessible by the array or by the RISC; D-Fabrix array itself.
  • 51.
    DFA 1000: AdvantagesOver FPGAs An order of magnitude better in die size for a given performance Retains homogeneous architecture – the whole array can be utilized for any given task Over DSPs An order of magnitude improvement in most algorithms, simply by matching the computing to the algorithm Over RISCs RISC architectures are not typically optimized for the dataflow algorithms common in DSP and media processing
  • 52.
    DFA 1000: Applications& Benchmarks 200Mpixel/second (two macro-blocks in parallel) JPEG Encoder 400Mpixels/sec (four 8x8 DCTs in parallel) 8x8 DCT (16 image lines in parallel) Dither 400Mpixels/sec Floyd-Steinberg Color ~1024 Voice channels UMTS Viterbi 400Msample/sec 5th Order CIC Filter
  • 53.
    DFA 1000: ImagingApplication one port captures data, the second displays it, the AHB is used for control
  • 54.
    DFA 1000: Software Defined Radio Application I/O ports used to transfer data to the antenna AHB used for data output to the host RISC.
  • 55.
    ACM – AdaptiveComputing Machine Based on SRGA – Self-Reconfigurable Gate Array Architecture QuickSilver Source: A Self-Reconfigurable Gate Array Architecture , Reetinder Sidhu et al, 10th International Workshop on Field Programmable Logic and Applications, August 2000 . A look into QuickSilver's ACM architecture, Paul Master, CTO QuickSilver, EE Times, September 12, 2002 (4:39 p.m. EST) Website www.qstech.com
  • 56.
    Self Reconfigurable GateArray Architecture Logic adapts itself based on computation proceeds, based on input and intermediate results Device needs to store multiple contexts of configuration and context switch between them Self Reconfiguration by modifying the configuration memory
  • 57.
    SRGA Architecture ASelf Reconfigurable device characterized having the following features: Fast Context Switching Fast Random Access of Configuration Memory Efficient Architecture should allow single cycle context switching as well as single cycle random memory access
  • 58.
    SRGA Architecture Consistsof a rectangular gate array of PEs Each PE consists of a logic cell and memory block Logic cell contains LUT and a flip-flop - Each PE connected to neighboring PEs and switches
  • 59.
    SRGA Architecture Aconfiguration context contains bits that configures all the logic cells and switches in the mesh of trees network Configuration context stored in memory blocks Memory access operation transfers data between rows and columns of PE Each memory block is implemented as random access memory that can read/write single bit every clock cycle
  • 60.
    SRGA: ContextSwitch Operation For a context switch to occur, some logic on the currently active context needs to write into a specified memory the address of the context to switch to In each memory block, the configuration bits are loaded into in the first half of the next clock cycle During the second half of the next clock cycle the current context is saved
  • 61.
    ACM: AdaptiveComputing Machine An SRGA based Heterogeneous Architecture Five types of nodes: Arithmetic – different, variable width, linear arithmetic functions like a FIR filter, a DCT, an FFT Bit manipulation – different, variable-width bit-manipulation functions like LFSR, Walsh code generator, GOLD code generator, TCP/IP packet discriminator Finite state machine , Scalar – execute legacy code Configurable input/output – I/O in the form of a UART or bus interfaces such as PCI, USB, Firewire and other I/O-intensive actions
  • 62.
  • 63.
    ACM: AdvantagesAny node can be adapted to perform a new function, clock cycle by clock cycle. Rather than passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock cycle-by-clock cycle basis . Adaptable hundreds of thousands of times a second Portions of an algorithm that are actually being executed need to be resident in the chip at any one time. Heavy Silicon reuse Tremendous reductions in silicon area & power consumption.
  • 64.
    XPP: eXtreme ProcessingPlatform Pact Corp Source: The XPP White Paper www.pactcorp.com
  • 65.
    XPP: eXtreme ProcessingPlatform The XPP idea consists of: Data stream processing Configurable ALUs communicating via a packet oriented, automatically synchronized communication network User transparent configuration management Works in the Dataflow Computing Paradigm
  • 66.
  • 67.
    XPP: Configurations Configurationsare basic parallel calculation modules which are derived from a data flow graph of an algorithm. Nodes of the data flow graph are mapped to the fundamental machine operations such as multiplication, addition etc. Graph’s edges are the connections between the nodes. As long as data packets stream through a single configuration, the graph remains static - no opcodes and connections are changed.
  • 68.
    XPP: Configurations –Vector * Matrix
  • 69.
    XPP: ConfigurationFlow: Decoupling of data processing & configuration Replace Von-Neumann instruction stream by a configuration stream Process streams of data instead of single machine words .
  • 70.
    PulseDSP™ Systolix Source: Website www.systolix.com
  • 71.
    PulseDSP™ A radicalnew approach to implementing programmable signal processing functions. Comes from Systolix's. Extraordinary performance High level of flexibility Very low manufacturing cost.
  • 72.
    PulseDSP™: Architecture Alarge number of highly efficient processors arranged in a "systolic array" Numerical data is pumped around the processors in the same way that blood is pumped around the body. Massive Parallelism helps exploit the parallelism in most DSP algorithms. Many billions of multiply accumulate calculations can be performed every second. 
  • 73.
    PulseDSP™: Cells Instructionsand data are all held locally - hence no memory bottlenecks. Inherently synchronous Guaranteed Performance Application independent. Simplifies and speeds development Enables the designer to make full use of the processing power available.
  • 74.
    PulseDSP™: Tools Directlymaps the original algorithm signal flow to the PulseDSP array User does not need to have any understanding of the underlying architecture. The compiler takes full advantage of the inherent parallelism in DSP algorithms to provide true parallel processing solution. The PulseDSP architecture doesn't just provide two or three parallel MAC units – it provides thousands, all fully utilized! 
  • 75.
    PulseDSP™: Summary HighlyScaleable Architecture       - Multiple Data Widths, 8 to 64 bits internally       - Multiple Array Sizes, 32 to 14,000 processing elements       - Multiple I/O Ports, up to 52 channels at 200MSPS each Sustained performance of up to 200GMAC/S for 16bit operations Real-time signal processing up to video rates Supports multiple independent signal data streams Integrated control and data processing functions Supports Linear and Non-Linear systems Dynamically or Statically programmed Full application development environment
  • 76.
    Programming RAP Overviewof Development Tools Source : Websites of Respective Companies
  • 77.
    Summary of ToolsCGN16XXX Data Flow Description / C Intelligent Network Processor Software Development Environment (SDE) Cognigine picoArray Signal Flow Description / Embedded C picoTools picoChip XPP Core / NML / System C C, NML (Native Mapping Language) XDS – Software Development Suite, XPP-VC –Vectorizing C-Compiler Pact Corp ACM C + temporal & spatial extensions. Silverware QuickSilver A7S CSoC VHDL / Verilog / Schematic Capture FastChip Development System Triscend PulseDSP cores Schematic Capture Systolix Design System (SDS) Systolix D-Fabrix Verilog / VHDL / Handel-C / MATLAB D-Sign Elixent Target Entry Tools Company
  • 78.
    D-Sign: Elixent NOM Generator NOM2EOM Converter Synthesis Physical RDA Generator AIM Lib ADM Lib Concorde Macros Optimisations Verilog / VHDL Design De-Compiler Intermediate Design in Verilog / VHDL / XML IDM Lib Library Maker Arch-I Adaptor Bit-twiddling Nibble align AIM ADM IDM PRO rules
  • 79.
    XPP: ApplicationsDevelopment The XPP development suite provides program development and debugging support. XPP-VC can perform instruction level parallelism pipelining automatic resource management multi-threading
  • 80.
    XDS – XPPDev. Tool: Pact Corp
  • 81.
    XPP-VC – XPPC Compiler: Pact Corp
  • 82.
    Sum Up Discussedthe issues in Architectural Choice for Embedded Algorithm Implementation Reviewed a few Reconfigurable Architectures Data Flow Paradigm – Petri Nets Systolic Arrays Configuration Streaming SIMD Looked at Tools Availability
  • 83.
  • 84.

Editor's Notes

  • #31 [20060325] Changed: Systolix www.systolix.co.uk RadioScape (www.radioscape.com) acquires Systolix: http://www.electronicstalk.com/news/rab/rab109.html [20051023] Dropped: Triscend www.triscend.com This has got acquired by Xilinx in Mar, 2003. Refer: http://www.xilinx.com/prs_rls/xil_corp/0435_triscend_acquisition.htm Xilinx’s solutions in Reconfigurable space is given in: http://www.xilinx.com/products/design_resources/config_sol/grouping/fpga_config.htm [20051024] Dropped: Cognigine www.cognigine.com Cannot find this website in the Internet. There are many references to their work – but all dates to early 2000s only.
  • #39 RadioScape uses Systolix's DSP expertise to expand its licensable intellectual property portfolio for Layer-1 wireless baseband development. http://www.electronicstalk.com/news/rab/rab109.html
  • #47 With this platform in place, algorithms are mapped onto the array. This is done by drawing the signal flow across the array - describing it in an HDL such as Verilog, or a higher-level language like Handel-C - or Matlab. Need an 8-bit adder? Use two ALUs. 32-bit adder? 8 ALUs. Perhaps an Add/Compare/Select (ACS) unit? Again, just a few ALUs.