SlideShare a Scribd company logo
1 of 1
` Malachy Devlin Robin Bruce, Stephen Marshall Submission 186  m.devlin@nallatech.com, robin.bruce@sli-institute.ac.uk & stephen.marshall@eee.strath.ac.uk High-Level Implementation of VSIPL on FPGA-based Reconfigurable Computers VSIPL: The Vector, Signal and Image API This poster examines architectures for FPGA-based high level language implementations of the Vector, Signal and Image Processing Library (VSIPL). VSIPL has been defined by a consortium of industry, government and academic representatives. It has become a  de facto  standard in the embedded signal processing world. VSIPL is a C-based application programming interface (API) and was initially targeted at a single microprocessor. The scope of the full API is extensive, offering a vast array of signal processing and linear algebra functionality for many data types. VSIPL implementations are generally a reduced subset or  profile  of the full API. The purpose of VSIPL is to provide developers with a standard, portable application development environment. Successful VSIPL implementations can have radically different underlying hardware and memory management systems and still allow applications to be ported between them with minimal modification of the code.  Implementing the VSIPL API on FPGA fabric FPGA Floating Point Comes of Age Architectural Possibilities Developing an FPGA-based implementation of VSIPL offers as many exciting possibilities as it presents challenges. The VSIPL specification has in no way been developed with FPGAs in mind. This research is being carried out in the expectation that it shall one day comply to a set of rules for VISPL that consider the advantages and limitations of FPGAs. Over the course of this research we expect to find solutions to several key difficulties that impede the development of FPGA VSIPL, paying particular attention to: 1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors.  2.) Runtime reconfiguration options offered by microprocessor VSIPL. 3.) Vast scope of VSIPL API vs. limited fabric resource. 4.) Communication bottleneck arising from software-hardware partition. 5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations. As well as the technical challenges this research poses there is a great logistical challenge. In the absence of a large research team, the sheer quantity of functions requiring hardware implementation seems overwhelming. In order to succeed this project has as a secondary research goal the development of an environment that allows for rapid development and deployment of signal processing algorithms in FPGA fabric. Implementation of a Single-Precision Floating-Point Boundless Convolver   VSIPL is fundamentally based around floating-point arithmetic and offers no fixed-point support. Both single-precision and double-precision floating-point arithmetic is available in the full API. Any viable VSIPL implementation must guarantee the user the dynamic range and precision that floating-point offers for repeatability of results.  FPGAs have in recent times rivalled microprocessors when used to implement bit-level and integer-based algorithms. Many techniques have been developed that enable hardware developers to attain the performance levels of floating-point arithmetic. Through careful fixed-point design the same results can be obtained using a fraction of the hardware that would be necessary for a floating-point implementation. Floating-point implementations are seen as bloated; a waste of valuable resource. Nevertheless, as the reconfigurable-computing user base grows, the same pressures that led to floating-point arithmetic logic units (ALUs) becoming standard in the microprocessor world are now being felt in the field of programmable logic design. Floating-point performance in FPGAs now rivals that of microprocessors; certainly in the case of single-precision and with double-precision gaining ground. Up to 25 GigaFLOPS/s are possible with single precision for the latest generation of FPGAs. Approximately a quarter of this figure is possible for double precision. Unlike CPUs, FPGAs are limited by peak FLOPS for many key applications and not memory bandwidth.  Fully floating-point implemented design solutions offer invaluable benefits to an enlarged reconfigurable-computing market. Reduced design time and far simplified verification are just two of the benefits that go a long way to addressing the issues of ever-decreasing design time and an increasing FPGA design skills shortage. All implementations presented here are single-precision floating-point. Fig.4.   System Diagram of Convolver System Results & Conclusions The floating point convolver was clocked at  120MHz  and to perform a 16k by 32k convolution took  16.96Mcycles , or  0.14 seconds  to complete. Due to the compute intensity of this operation, host-fabric data transfer time is negligible. There is a  10x speed increase  on the  1.4 seconds  the same operation took in software alone.  FPGA-based VSIPL would offer significant performance benefits over processor-based implementations. It has been shown how the boundless nature of VSIPL vectors can be overcome through a combination of fabric and software-managed reuse of pipelined fabric-implemented kernels. The resource demands of floating-point implemented algorithms have been highlighted and the need for resource re-use or even run-time fabric reconfiguration is stressed.  The foundations have been laid for the development environment that will allow for all VSIPL functions, as well as custom user algorithms, to be implemented. Architecture Models Hardware Development Platform Figure 2   DIMETalk Network Figure 1   Nallatech ‘BenNUEY’ PCI Motherboard Figure 3   BenNUEY System Diagram The inner convolution kernel convolves the input vector A with 32 coefficients taken from the input vector B. The results are added to the relevant entry of the temporary result blockRAM (initialised to zero). This kernel is re-used as many times as is necessary to convolve all of the 32-word chunks of input vector B with the entire vector A. The fabric convolver carries out the largest convolution possible with the memory resources available on the device, a 16k by 32k convolution producing a 48k word result. This fabric convolver can be re-used by a C program on the host as many times as is necessary to realise the convolver or FIR operation of the desired dimensions. This is all wrapped up in a single C-function that is indistinguishable to the user from a fully software-implemented function FPGA Fabric Convolver implemented in C The inner pipelined kernel and its reuse at the fabric level consist of a C-to-VHDL generated component created using a floating point C compiler tool from Nallatech. The “C” code used to describe the necessary algorithms is a true subset of C and any code that compiles in the C-to-VHDL compiler can be compiled using any ANSI-compatible C compiler without modification.  The design was implemented on the user FPGA of a BenNUEY motherboard (figures 1 and 3). Host-fabric communication was handled by a packet-based DIMETalk network (figure 2) providing scalability over multi-FPGAs. SRAM storage took place using two banks of ZBT RAM. The FPGA on which the design was implemented was a Xillinx Virtex-II XC2V6000. The design used: 144 16k BlockRAMS (100%), 25679 Slices (75%) and 128 18x18 Multipliers (88%).  The pipelined inner kernel of the convolver, as seen in figure 7, is implemented using 32 floating-point multipliers and 31 floating-point adders. These are Nallatech single-precision cores which are  assembled from a C description of the algorithm using a C-to-VHDL converter. The C function call arguments and return values are managed over the DIMEtalk FPGA communications infrastructure. To provide an example of how the VSIPL constraints that allow for unlimited input and output vector lengths can be met, a boundless convolver is presented. Suitable for 1D convolution and FIR filtering this component is implemented using single-precision floating-point arithmetic throughout. Figure 7   Boundless Convolver System In implementing VSIPL FPGAs there are several potential architectures that are being explored. The three primary architectures are as follows, in increasing order of potential performance and design complexity: 1.) Function ‘stubs’:  Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model. 2.) Full Application Algorithm Implementation:  Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations. 3.) FPGA Master Implementation:  Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. Depending on the rest of the system architecture, the FPGA VSIPL-based application can use the host computer as a I/O engine, or alternatively the FPGA can have direct I/O attachments thus bypassing operating system latencies. Initialisation Function 1 Stub Function 2 Stub Function 3 Stub Termination Function 1 Function 2 Function 3 Host Fabric Figure 4   Functional ‘Stubs’ Implementation Host Fabric Figure 5   Full Application Algorithm Implementation Figure 6   FPGA Master Implementation Initialisation Data Transfer Data Transfer Termination Function 1 Function 2 Function 3 Data Data Host Fabric Pipelined Convolution Kernel  Fabric-Level Reuse of Pipelined Kernel  C_storage[32] Registers Temp_res[48k] BlockRAM InputA[32k] SRAM Output[48k] SRAM InputB[16k] BlockRAM DIMETalk Network Xilinx Virtex-II XC2V6000 FPGA on ‘BenNUEY’ Motherboard Host PC Software Reuse of Fabric Convolver PCI Host-Fabric Communcation Data IO Data IO Initialisation Function 1 Function 2 Function 3 Termination

More Related Content

What's hot

Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
chiportal
 

What's hot (20)

resume_RAVI
resume_RAVIresume_RAVI
resume_RAVI
 
MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
J044084349
J044084349J044084349
J044084349
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application Processors
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Resume_27_06_2016
Resume_27_06_2016Resume_27_06_2016
Resume_27_06_2016
 
Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...
 
Iain_McColl
Iain_McCollIain_McColl
Iain_McColl
 
3D-DRESD CiTiES
3D-DRESD CiTiES3D-DRESD CiTiES
3D-DRESD CiTiES
 
MATLAB LTE Toolbox Projects Research Help
MATLAB LTE Toolbox Projects Research HelpMATLAB LTE Toolbox Projects Research Help
MATLAB LTE Toolbox Projects Research Help
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Br4201458461
Br4201458461Br4201458461
Br4201458461
 
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
 

Similar to 186 devlin p-poster(2)

Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
ideas2ignite
 
37248136-Nano-Technology.pdf
37248136-Nano-Technology.pdf37248136-Nano-Technology.pdf
37248136-Nano-Technology.pdf
TB107thippeswamyM
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
eSAT Journals
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
VLSICS Design
 

Similar to 186 devlin p-poster(2) (20)

chameleon chip
chameleon chipchameleon chip
chameleon chip
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...Automatically partitioning packet processing applications for pipelined archi...
Automatically partitioning packet processing applications for pipelined archi...
 
Nt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmNt1310 Unit 5 Algorithm
Nt1310 Unit 5 Algorithm
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...
 
37248136-Nano-Technology.pdf
37248136-Nano-Technology.pdf37248136-Nano-Technology.pdf
37248136-Nano-Technology.pdf
 
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration using
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
 
Fpg as 11 body
Fpg as 11 bodyFpg as 11 body
Fpg as 11 body
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
 
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceAI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
 
CaseStudies
CaseStudiesCaseStudies
CaseStudies
 
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAA LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
Implementation of CAN on FPGA for Security Evaluation Purpose
Implementation of CAN on FPGA for Security Evaluation PurposeImplementation of CAN on FPGA for Security Evaluation Purpose
Implementation of CAN on FPGA for Security Evaluation Purpose
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

186 devlin p-poster(2)

  • 1. ` Malachy Devlin Robin Bruce, Stephen Marshall Submission 186 m.devlin@nallatech.com, robin.bruce@sli-institute.ac.uk & stephen.marshall@eee.strath.ac.uk High-Level Implementation of VSIPL on FPGA-based Reconfigurable Computers VSIPL: The Vector, Signal and Image API This poster examines architectures for FPGA-based high level language implementations of the Vector, Signal and Image Processing Library (VSIPL). VSIPL has been defined by a consortium of industry, government and academic representatives. It has become a de facto standard in the embedded signal processing world. VSIPL is a C-based application programming interface (API) and was initially targeted at a single microprocessor. The scope of the full API is extensive, offering a vast array of signal processing and linear algebra functionality for many data types. VSIPL implementations are generally a reduced subset or profile of the full API. The purpose of VSIPL is to provide developers with a standard, portable application development environment. Successful VSIPL implementations can have radically different underlying hardware and memory management systems and still allow applications to be ported between them with minimal modification of the code. Implementing the VSIPL API on FPGA fabric FPGA Floating Point Comes of Age Architectural Possibilities Developing an FPGA-based implementation of VSIPL offers as many exciting possibilities as it presents challenges. The VSIPL specification has in no way been developed with FPGAs in mind. This research is being carried out in the expectation that it shall one day comply to a set of rules for VISPL that consider the advantages and limitations of FPGAs. Over the course of this research we expect to find solutions to several key difficulties that impede the development of FPGA VSIPL, paying particular attention to: 1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors. 2.) Runtime reconfiguration options offered by microprocessor VSIPL. 3.) Vast scope of VSIPL API vs. limited fabric resource. 4.) Communication bottleneck arising from software-hardware partition. 5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations. As well as the technical challenges this research poses there is a great logistical challenge. In the absence of a large research team, the sheer quantity of functions requiring hardware implementation seems overwhelming. In order to succeed this project has as a secondary research goal the development of an environment that allows for rapid development and deployment of signal processing algorithms in FPGA fabric. Implementation of a Single-Precision Floating-Point Boundless Convolver VSIPL is fundamentally based around floating-point arithmetic and offers no fixed-point support. Both single-precision and double-precision floating-point arithmetic is available in the full API. Any viable VSIPL implementation must guarantee the user the dynamic range and precision that floating-point offers for repeatability of results. FPGAs have in recent times rivalled microprocessors when used to implement bit-level and integer-based algorithms. Many techniques have been developed that enable hardware developers to attain the performance levels of floating-point arithmetic. Through careful fixed-point design the same results can be obtained using a fraction of the hardware that would be necessary for a floating-point implementation. Floating-point implementations are seen as bloated; a waste of valuable resource. Nevertheless, as the reconfigurable-computing user base grows, the same pressures that led to floating-point arithmetic logic units (ALUs) becoming standard in the microprocessor world are now being felt in the field of programmable logic design. Floating-point performance in FPGAs now rivals that of microprocessors; certainly in the case of single-precision and with double-precision gaining ground. Up to 25 GigaFLOPS/s are possible with single precision for the latest generation of FPGAs. Approximately a quarter of this figure is possible for double precision. Unlike CPUs, FPGAs are limited by peak FLOPS for many key applications and not memory bandwidth. Fully floating-point implemented design solutions offer invaluable benefits to an enlarged reconfigurable-computing market. Reduced design time and far simplified verification are just two of the benefits that go a long way to addressing the issues of ever-decreasing design time and an increasing FPGA design skills shortage. All implementations presented here are single-precision floating-point. Fig.4. System Diagram of Convolver System Results & Conclusions The floating point convolver was clocked at 120MHz and to perform a 16k by 32k convolution took 16.96Mcycles , or 0.14 seconds to complete. Due to the compute intensity of this operation, host-fabric data transfer time is negligible. There is a 10x speed increase on the 1.4 seconds the same operation took in software alone. FPGA-based VSIPL would offer significant performance benefits over processor-based implementations. It has been shown how the boundless nature of VSIPL vectors can be overcome through a combination of fabric and software-managed reuse of pipelined fabric-implemented kernels. The resource demands of floating-point implemented algorithms have been highlighted and the need for resource re-use or even run-time fabric reconfiguration is stressed. The foundations have been laid for the development environment that will allow for all VSIPL functions, as well as custom user algorithms, to be implemented. Architecture Models Hardware Development Platform Figure 2 DIMETalk Network Figure 1 Nallatech ‘BenNUEY’ PCI Motherboard Figure 3 BenNUEY System Diagram The inner convolution kernel convolves the input vector A with 32 coefficients taken from the input vector B. The results are added to the relevant entry of the temporary result blockRAM (initialised to zero). This kernel is re-used as many times as is necessary to convolve all of the 32-word chunks of input vector B with the entire vector A. The fabric convolver carries out the largest convolution possible with the memory resources available on the device, a 16k by 32k convolution producing a 48k word result. This fabric convolver can be re-used by a C program on the host as many times as is necessary to realise the convolver or FIR operation of the desired dimensions. This is all wrapped up in a single C-function that is indistinguishable to the user from a fully software-implemented function FPGA Fabric Convolver implemented in C The inner pipelined kernel and its reuse at the fabric level consist of a C-to-VHDL generated component created using a floating point C compiler tool from Nallatech. The “C” code used to describe the necessary algorithms is a true subset of C and any code that compiles in the C-to-VHDL compiler can be compiled using any ANSI-compatible C compiler without modification. The design was implemented on the user FPGA of a BenNUEY motherboard (figures 1 and 3). Host-fabric communication was handled by a packet-based DIMETalk network (figure 2) providing scalability over multi-FPGAs. SRAM storage took place using two banks of ZBT RAM. The FPGA on which the design was implemented was a Xillinx Virtex-II XC2V6000. The design used: 144 16k BlockRAMS (100%), 25679 Slices (75%) and 128 18x18 Multipliers (88%). The pipelined inner kernel of the convolver, as seen in figure 7, is implemented using 32 floating-point multipliers and 31 floating-point adders. These are Nallatech single-precision cores which are assembled from a C description of the algorithm using a C-to-VHDL converter. The C function call arguments and return values are managed over the DIMEtalk FPGA communications infrastructure. To provide an example of how the VSIPL constraints that allow for unlimited input and output vector lengths can be met, a boundless convolver is presented. Suitable for 1D convolution and FIR filtering this component is implemented using single-precision floating-point arithmetic throughout. Figure 7 Boundless Convolver System In implementing VSIPL FPGAs there are several potential architectures that are being explored. The three primary architectures are as follows, in increasing order of potential performance and design complexity: 1.) Function ‘stubs’: Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model. 2.) Full Application Algorithm Implementation: Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations. 3.) FPGA Master Implementation: Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. Depending on the rest of the system architecture, the FPGA VSIPL-based application can use the host computer as a I/O engine, or alternatively the FPGA can have direct I/O attachments thus bypassing operating system latencies. Initialisation Function 1 Stub Function 2 Stub Function 3 Stub Termination Function 1 Function 2 Function 3 Host Fabric Figure 4 Functional ‘Stubs’ Implementation Host Fabric Figure 5 Full Application Algorithm Implementation Figure 6 FPGA Master Implementation Initialisation Data Transfer Data Transfer Termination Function 1 Function 2 Function 3 Data Data Host Fabric Pipelined Convolution Kernel Fabric-Level Reuse of Pipelined Kernel C_storage[32] Registers Temp_res[48k] BlockRAM InputA[32k] SRAM Output[48k] SRAM InputB[16k] BlockRAM DIMETalk Network Xilinx Virtex-II XC2V6000 FPGA on ‘BenNUEY’ Motherboard Host PC Software Reuse of Fabric Convolver PCI Host-Fabric Communcation Data IO Data IO Initialisation Function 1 Function 2 Function 3 Termination