Reconfigurable Computing

March 25, 2006 Reconfigurable Computing Dr. Partha Pratim Das Head of Engineering, Interra Systems (India) Pvt. Ltd. Emerging Architectures for Embedded Systems

In memory of … Ben Sloman, 1967 - 2002 Founder Vice President of Corporate Development & Software Engineering Elixent Ltd., UK. In 2001, Ben introduced me to the wonderful world of Reconfigurable Computing.

Source & Disclaimer Information about the Paradigms, Architectures, Applications, Tools, Capacity, Advantages & Scope of various RC Architectures have been borrowed from the respective sites of the companies. The speaker bears no responsibility for their correctness. Neither does he promote or demote any specific company on the merit or otherwise of their technology.

Outline Settings the Stage – Why Reconfigurable Computing? Reconfigurable Computing – Leading Companies Reconfigurable Computing – Case Study of New Computing Machines D-Fabrix – based on Reconfigurable Array Architecture ACM – based on SRGA XPP – based on Dataflow Computing PulseDSP™ – based on Systolic Computing Programming RAP – Overview of Development Tools Sum Up Questions

Setting the Stage What are Embedded Systems? & Why Reconfigurable Computing for them?

What are Embedded Systems? Any device that includes a Computer but is not itself a General-Purpose Computer Computers as Components

Other Examples Personal Convenience Calculator Alarm (Talking) Clock Radio CD / MP3 Player Personal Digital Assistance (PDA) Photo Copier Personal Communication Cordless Phone Answering Machine Cell Phone Fax Public Utilities Automatic Teller Machine Electronic Voting Machine Camera Analog Digital Handy Cam Automobile Engine Brake Dash Car Stereo Aviation Television Analog TV: Channel Selection Digital TV: Decompression CAS: De-scrambling Household Appliances Microwave Oven Washing Machine Air-conditioner Surveillance Systems Burglar Alarm CCTV Metal Detector Biometric Identification System Secure ID Control & Automation Railway Signaling Steel Industry – Blast Furnace Aluminum Extraction Fire Alarm Industrial Process Control Medical Systems Pace Maker Laparoscopic Appliances Monitors – ECG, EEG, PET Computer Accessories Printer Plotter Scanner Networking NIC Cards N/w Components – HUB, Router, Switch Modem Global Positioning System Navigation Exploration

Characteristics of Embedded Systems Real-Time Operation (always ?) Low Manufacturing Cost Low Power Universal & Market Driven Sophisticated Functionality Application Dependent Processor Restricted Memory Fault Tolerant Safe Domain Specific & Technology Driven

Embedded Systems Market Segments Source: The Death of the DSP by Nick Tredennick www.qstech.com , August 2000

The zero-cost segment To a first approximation represents almost all of the embedded systems market The segment for which low cost is the overriding consideration. Consumer appliances that generally have minimal processing needs microwave ovens, electric razors, blenders, toasters, washing machines, … Sells in high volumes (millions of units to tens of millions of units). Characterized by intense price competition - ideal would be zero cost to implement.

The zero-power segment To a first approximation represents a few percent of the embedded systems market The segment for which zero power dissipation represents the ideal. Consumer items that are expected to run on a single button-size battery or on weak ambient light smoke detectors, basic cellular phones, pagers, pacemakers, hearing aids, MP3 players, pocket calculators, etc. Minimum product cost remains a concern.

The zero-delay segment To a first approximation represents a little more than zero percent of the embedded systems market The segment for which zero delay from data in to result out represents the ideal. Consumer items high-end printers, scanners, copiers, and fax machines, Processing power and throughput are important Minimum product cost is still the criteria

The zero-volume segment To more than a first approximation, represents zero percent of the embedded systems market The segment for which the application potential is nearly zero.  production volumes and profits will also be close to zero Why Intel did design 80960MX microprocessor? The only known application was the YF-22 aircraft. Later the only prototype of the YF-22 crashed & the application volume for the ’960MX actually went to zero. Could Intel have expected to sell more than a few thousand ’960MX processors? There must be some other reason to capture the application. One motive is public relations.

The Leading Edge Wedge Handheld devices digital cameras, mobile phones, GPS receivers, PDAs, etc. drive more computing into portable devices. Being consumer devices, they fall into the zero-cost segment . Having high computing requirements, they fall into the zero-delay segment . Being portable, they fall into the zero-power segment . Target: cheap, highly capable devices that give us instant answers and that work on weak ambient light. The overlap of the zero-cost, zero-delay, and zero-power segments is the leading-edge wedge .

Approach to Computing TASK : First we think of some task or function that we wish to perform ALGORITHM : Next we define an algorithm that describes how to perform the task MAP : Ultimately we map our algorithm into some kind of physical implementation that will execute the task

Computing Models ASIC –Functionality fixed during fab Custom SoC Structured ASIC FPGA – Programmable functionality Embedded Processor Core Embedded Custom ASIC (peripherals)  P – General Purpose programmable device DSP – Specialized  P

Limitations Of ASICs Prolonged design cycle High NRE Algorithm is frozen into h/w – no flexibility Each function needs its own implementation in h/w – more area and more power Design based on HDLs – not good at representing algorithms

Limitations of FPGAs Reconfiguration is slow Reconfiguration is power consuming Inefficient use of available logic Often leads to combinatorial explosion Good for Rapid Prototyping Design based on HDLs – not good at representing algorithms

Limitations Of DSP/  P Algorithm has to be artificially partitioned Constrained to meet the physical bus width Mapped on the instruction set of the target device Inefficient utilization of available resources Loss of inherent parallelism of the algorithm

Summary Observations Algorithm-friendly design tools Low Power Re-configurability 100,000 times / sec Dynamic Dynamic RC General purpose h/w – fixed & inefficient Algorithms changeable but artificially partitioned & constrained to match h/w Pseudo-Dynamic Rigid  P / DSP Power-hungry Slow to re-configure Inefficient design tools Rigid Pseudo-Dynamic FPGA The h/w is fixed. The algorithm frozen in h/w. Design tools not good for algorithms. Rigid Rigid ASIC Remarks Algorithms Hardware Resources

Search for New Machines Should be Low Power Should be Low Area Should keep pace with evolving standards Should fit the algorithms – Implement efficiently Should be Low cost to design Should be Fast to market Dynamic algorithms implemented on dynamic resources

Reconfigurable Architecture Requirements Scalable hardware-like performance Massive instruction-level parallelism Balance between compute, memory and communication Tunable number and type of resources Software-like flexibility Support multiple applications Commit function any time after silicon fabrication Fast function changes for multi-mode products Silicon efficiency Price/performance overhead must be low

Alternate Nomenclature Reconfigurable Computing Configurable Computing Reconfigurable Array Architecture Self Reconfiguring Architecture Adaptive Computing Machine Reconfigurable Algorithmic Process (RAP) Dataflow Computing

Reconfigurable Computing Leading Companies

FPGA Leaders Xilinx www.xilinx.com Altera www.altera.com Actel www.actel.com Quicklogic www.quicklogic.com Lattice www.latticesemi.com In a way an FPGA may also be an RC architecture. Reconfiguration in FPGA is Slow, Power-hungry and Less flexible. In this lecture we would stay away from reviewing FPGA companies & technologies while we look for more dynamic architectural options.

RC Runners Elixent www.elixent.com QuickSilver www.qstech.com Pact Corp www.pactcorp.com Systolix www.systolix.co.uk (WOS of RadioScape from Jan, 2002) Xilinx www.xilinx.com (acquired Triscend in Mar, 2004) Pico Chip www.picochip.com Let us first review these companies before taking up case studies for some of their architectures

Elixent Ltd. Supplies Reconfigurable Algorithmic Processors (RAP) emerging IP business space significant competitive edge for 1st tier OEMs, Application Specific Standard Part vendors and IC integrators. Elixent technology addresses the top three customer needs: Increased functionality Reduced Time To Market Reduced Design Costs

Elixent Ltd. UK based Company Offices in UK, US & Japan Spin-off from HP Research Laboratories Investors VC Firm 3i Hewlett Packard Actel Partners Interra – HDL entry Celoxica – Handel-C entry AccelChip – MATLAB entry Others in the DSP Design Space

Quick Silver Technology Adaptive Computing Machine (ACM) Functionality will adapt – on-the-fly – by downloading s/w applications Single device can perform various media rich apps Has its own language/design tools Silverware Strongly suggests that HDLs are not the way to go could be marketing stunt for non-availability of HDL solution

Quick Silver Mobile Communication company Founded 1998 San Jose based company Offices in San Diego, Seattle, U.K. and Japan Got 13 Million new funding in April, 2002 Investors: TechFund Capital JP Morgan Partners Portview Communication Partners Selby Venture Partners Bellsouth cellural Kyocera Corp

Pact Corp Technology eXtreme Processing Platform (XPP) Set of ALU, RAM, I/O elements along with Configuration manager Mix of C-Subset and Native Mapping Language(NML) Have a C-compiler (XPP-VC) Simulation and development tools around C and NML NML is a sort of structural language High level of components are there like Counters

Pact Corp Fabless semiconductor and IP vendor Offers IP and ASSPs based on its XPP architecture Simulation and Development tools available Corporate office in Germany Sales/Marketing office in San Jose Recently teamed with Quicklogic Have sound funding

PulseDSP™: Systolix Highly Scaleable Architecture Multiple Data Widths, 8 to 64 bits internally Multiple Array Sizes, 32 to 14,000 processing elements Multiple I/O Ports, up to 52 channels at 200MSPS each Sustained performance of up to 200GMAC/S for 16bit operations Real-time signal processing up to video rates Supports multiple independent signal data streams Integrated control and data processing functions Supports Linear and Non-Linear systems Dynamically or Statically programmed Full application development environment

Systolix PulseDSP Ltd A wholly owned subsidiary of RadioScape Ltd . Founded in 1998 and is based in Liverpool, UK. Acquired by RadioScape in Jan 2002. Specializes in the development and commercial licensing of advanced DSP technologies and associated software tools. Introduced PulseDSP technology A multiprocessor architecture that provides very low cost, high performance programmable DSP. PulseDSP technology is ideal for high MAC rates and rapid data throughput Example – digital IF, baseband signal processing, software radio solutions. Licensed the first PulseDSP technology to Analog Devices.

Reconfigurable Architectures Case Study of New Computing Machines

Case Study D-Fabrix Array ACM – Adaptive Computing Machine Based on SRGA – Self-Reconfigurable Gate Array Architecture XPP – eXtreme Processing Platform PulseDSP™ – Systolic Architecture

D-Fabrix Array Elixent Source : DFA1000 RISC Accelerator Data Sheet & Website www.elixent.com

D-Fabrix Array Massive instruction-level-parallelism Regular tiled structure: low design cost, configurable, portable Rapid re-configuration

D-Fabrix Array Components are: 4-bit ALUs, registers and the "switchbox".

Basic D-Fabrix Array Element ALU 4 4 1 Typical instructions: A + B Cin ? A:B A - B A == B A & B A > B A | B not A A xor B not B INSTR 4 A B C IN /CONTROL C OUT 1 4 F Output register options: Transparent Reset: 0000, 1111 Clocked: always, when enabled, never REG

D-Fabrix Array: Tile Combine two of each into the "tile".

D-Fabrix Array: Tiling & Memory Combine 100’s or 1000’s of tiles to create the D-Fabrix array. Memory is distributed to give fast, local storage with massive bandwidth.

D-Fabrix: Routing 16 4-bit busses cross each ALU horizontally and vertically for short and long connections M 4-bit connections are made by setting a configuration bit to ‘1’ Each ALU connects to 8 others via just one switch delay Under 128 configuration bits per ALU+switchbox => Can configure 512 ALUs (64 kbits) in ~20  s

D-Fabrix Array: Virtual Hardware Once the math units are in place, the switchboxes link them together. They are part of a rich interconnect, providing both local and global connectivity. The algorithm implemented in "Virtual Hardware", it's being processed on a hardware implementation. But, it's software. A new set of "virtual hardware" can be loaded at any time. In microseconds you can switch to the next hardware configuration. Some applications have one configuration, and only alter that when the standards change, or the specification creeps. Others re-use the silicon dynamically, switching modes, or even "folding" algorithms to use smaller arrays.

DFA 1000: D-Fabrix based RISC Accelerator

DFA 1000: D-Fabrix based RISC Accelerator Pre-configured D-Fabrix implementation for accelerating RISC Systems A peripheral set to facilitate its integration into SOC designs High-speed data interfaces to the D-Fabrix core array low latency and no overhead on the system bus. The AMBA bus interface Programming the array, Transferring data to and from the host RISC Much lower bandwidth control and configuration path. Local high-speed RAMs, directly accessible by the array or by the RISC; D-Fabrix array itself.

DFA 1000: Advantages Over FPGAs An order of magnitude better in die size for a given performance Retains homogeneous architecture – the whole array can be utilized for any given task Over DSPs An order of magnitude improvement in most algorithms, simply by matching the computing to the algorithm Over RISCs RISC architectures are not typically optimized for the dataflow algorithms common in DSP and media processing

DFA 1000: Applications & Benchmarks 200Mpixel/second (two macro-blocks in parallel) JPEG Encoder 400Mpixels/sec (four 8x8 DCTs in parallel) 8x8 DCT (16 image lines in parallel) Dither 400Mpixels/sec Floyd-Steinberg Color ~1024 Voice channels UMTS Viterbi 400Msample/sec 5th Order CIC Filter

DFA 1000: Imaging Application one port captures data, the second displays it, the AHB is used for control

DFA 1000: Software Defined Radio Application I/O ports used to transfer data to the antenna AHB used for data output to the host RISC.

ACM – Adaptive Computing Machine Based on SRGA – Self-Reconfigurable Gate Array Architecture QuickSilver Source: A Self-Reconfigurable Gate Array Architecture , Reetinder Sidhu et al, 10th International Workshop on Field Programmable Logic and Applications, August 2000 . A look into QuickSilver's ACM architecture, Paul Master, CTO QuickSilver, EE Times, September 12, 2002 (4:39 p.m. EST) Website www.qstech.com

Self Reconfigurable Gate Array Architecture Logic adapts itself based on computation proceeds, based on input and intermediate results Device needs to store multiple contexts of configuration and context switch between them Self Reconfiguration by modifying the configuration memory

SRGA Architecture A Self Reconfigurable device characterized having the following features: Fast Context Switching Fast Random Access of Configuration Memory Efficient Architecture should allow single cycle context switching as well as single cycle random memory access

SRGA Architecture Consists of a rectangular gate array of PEs Each PE consists of a logic cell and memory block Logic cell contains LUT and a flip-flop - Each PE connected to neighboring PEs and switches

SRGA Architecture A configuration context contains bits that configures all the logic cells and switches in the mesh of trees network Configuration context stored in memory blocks Memory access operation transfers data between rows and columns of PE Each memory block is implemented as random access memory that can read/write single bit every clock cycle

SRGA: Context Switch Operation For a context switch to occur, some logic on the currently active context needs to write into a specified memory the address of the context to switch to In each memory block, the configuration bits are loaded into in the first half of the next clock cycle During the second half of the next clock cycle the current context is saved

ACM: Adaptive Computing Machine An SRGA based Heterogeneous Architecture Five types of nodes: Arithmetic – different, variable width, linear arithmetic functions like a FIR filter, a DCT, an FFT Bit manipulation – different, variable-width bit-manipulation functions like LFSR, Walsh code generator, GOLD code generator, TCP/IP packet discriminator Finite state machine , Scalar – execute legacy code Configurable input/output – I/O in the form of a UART or bus interfaces such as PCI, USB, Firewire and other I/O-intensive actions

ACM: Advantages Any node can be adapted to perform a new function, clock cycle by clock cycle. Rather than passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock cycle-by-clock cycle basis . Adaptable hundreds of thousands of times a second Portions of an algorithm that are actually being executed need to be resident in the chip at any one time. Heavy Silicon reuse Tremendous reductions in silicon area & power consumption.

XPP: eXtreme Processing Platform Pact Corp Source: The XPP White Paper www.pactcorp.com

XPP: eXtreme Processing Platform The XPP idea consists of: Data stream processing Configurable ALUs communicating via a packet oriented, automatically synchronized communication network User transparent configuration management Works in the Dataflow Computing Paradigm

XPP: Configurations Configurations are basic parallel calculation modules which are derived from a data flow graph of an algorithm. Nodes of the data flow graph are mapped to the fundamental machine operations such as multiplication, addition etc. Graph’s edges are the connections between the nodes. As long as data packets stream through a single configuration, the graph remains static - no opcodes and connections are changed.

XPP: Configurations – Vector * Matrix

XPP: Configuration Flow: Decoupling of data processing & configuration Replace Von-Neumann instruction stream by a configuration stream Process streams of data instead of single machine words .

PulseDSP™ Systolix Source : Website www.systolix.com

PulseDSP™ A radical new approach to implementing programmable signal processing functions. Comes from Systolix's. Extraordinary performance High level of flexibility Very low manufacturing cost.

PulseDSP™: Architecture A large number of highly efficient processors arranged in a "systolic array" Numerical data is pumped around the processors in the same way that blood is pumped around the body. Massive Parallelism helps exploit the parallelism in most DSP algorithms. Many billions of multiply accumulate calculations can be performed every second.

PulseDSP™: Cells Instructions and data are all held locally - hence no memory bottlenecks. Inherently synchronous Guaranteed Performance Application independent. Simplifies and speeds development Enables the designer to make full use of the processing power available.

PulseDSP™: Tools Directly maps the original algorithm signal flow to the PulseDSP array User does not need to have any understanding of the underlying architecture. The compiler takes full advantage of the inherent parallelism in DSP algorithms to provide true parallel processing solution. The PulseDSP architecture doesn't just provide two or three parallel MAC units – it provides thousands, all fully utilized!

PulseDSP™: Summary Highly Scaleable Architecture - Multiple Data Widths, 8 to 64 bits internally - Multiple Array Sizes, 32 to 14,000 processing elements - Multiple I/O Ports, up to 52 channels at 200MSPS each Sustained performance of up to 200GMAC/S for 16bit operations Real-time signal processing up to video rates Supports multiple independent signal data streams Integrated control and data processing functions Supports Linear and Non-Linear systems Dynamically or Statically programmed Full application development environment

Programming RAP Overview of Development Tools Source : Websites of Respective Companies

Summary of Tools CGN16XXX Data Flow Description / C Intelligent Network Processor Software Development Environment (SDE) Cognigine picoArray Signal Flow Description / Embedded C picoTools picoChip XPP Core / NML / System C C, NML (Native Mapping Language) XDS – Software Development Suite, XPP-VC –Vectorizing C-Compiler Pact Corp ACM C + temporal & spatial extensions. Silverware QuickSilver A7S CSoC VHDL / Verilog / Schematic Capture FastChip Development System Triscend PulseDSP cores Schematic Capture Systolix Design System (SDS) Systolix D-Fabrix Verilog / VHDL / Handel-C / MATLAB D-Sign Elixent Target Entry Tools Company

D-Sign: Elixent NOM Generator NOM2EOM Converter Synthesis Physical RDA Generator AIM Lib ADM Lib Concorde Macros Optimisations Verilog / VHDL Design De-Compiler Intermediate Design in Verilog / VHDL / XML IDM Lib Library Maker Arch-I Adaptor Bit-twiddling Nibble align AIM ADM IDM PRO rules

XPP: Applications Development The XPP development suite provides program development and debugging support. XPP-VC can perform instruction level parallelism pipelining automatic resource management multi-threading

XDS – XPP Dev. Tool: Pact Corp

XPP-VC – XPP C Compiler: Pact Corp

Sum Up Discussed the issues in Architectural Choice for Embedded Algorithm Implementation Reviewed a few Reconfigurable Architectures Data Flow Paradigm – Petri Nets Systolic Arrays Configuration Streaming SIMD Looked at Tools Availability

Reconfigurable Computing

More Related Content

What's hot

Viewers also liked

Similar to Reconfigurable Computing

More from ppd1961

Recently uploaded

Reconfigurable Computing

Editor's Notes