Automatic generation of platform architectures using open cl and fpga roadmap

172 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Automatic generation of platform architectures using open cl and fpga roadmap

  1. 1. Automatic generation of platform architectures using OpenCL FPGA roadmap Department of Electrical and Computer Engineering University of Thessaly Volos, Greece Nikolaos Bellas
  2. 2. What is an FPGA? • Field Programmable Gate Array (FPGA) is the best known example of Reconfigurable Logic • Hardware can be modified post chip fabrication • Tailor the Hardware to the application – Fixed logic processors (CPUs/GPUs) only modify their software (via programming) • FPGAs can offer superior performance, performance/power, or performance/cost compared to CPUs and GPUs. 2
  3. 3. FPGA architecture • A generic island-style FPGA fabric • Configurable Logic Blocks (CLB) and Programmable Switch Matrices (PSM) • Bitstream configures functionality of each CLB and interconnection between logic blocks 3
  4. 4. The Xilinx Slice • Xilinx slice features – LUTs – MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram) – Carry Logic – MULT_ANDs – Sequential Elements •Detailed Structure
  5. 5. LUTLUT Example 2-input LUT • Lookup table: a b out 0 0 0 1 1 0 1 1 a b out 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 5 configuration input
  6. 6. Modern FPGA architecture Xilinx Virtex family 6 •Columns of on-chips SRAMs, hard IP cores (PPC 405), and •DSP slices (Multiply-Accumulate) units
  7. 7. FPGA discussion • Advantages – Potential for (near) optimal performance for a given application – Various forms of parallelisms can be exploited • Disadvantages – Programmable mainly at the hardware level using Hardware Description Languages (BUT, this can change) – Lower clock frequency (200-300 MHz) compared to CPUs (~ 3GHz) and GPUs (~1.5 GHz) 7
  8. 8. MATENVMED Silicon OpenCL: Automatic generation of platform architectures using OpenCL 8
  9. 9. 18-19/7/2013 MATENVMED Plenary Meeting Introduction • Automatic generation of hardware at the research forefront in the last 10 years. • Variety of High Level Programming Models: C/C++, C-like Languages, MATLAB • Obstacles: – Parallelism Extraction for larger applications – Extensive Compiler Transformations & Optimizations • Parallel Programming Models to the Rescue: – CUDA, OpenCL. 9
  10. 10. 18-19/7/2013 MATENVMED Plenary Meeting Motivation • Parallel programming models are for reconfigurable platforms. • A major shift of Computing industry toward many-core computing systems. • Reconfigurable fabrics bear a strong resemblance to many core systems. 10
  11. 11. 18-19/7/2013 MATENVMED Plenary Meeting Vision • Provide the tools and methodology to enable the large pool of software developers and domain experts, who do not necessarily have expertise on hardware design, to architect whole accelerator-based systems – Borrowed from advances in massively parallel programming models 11 FPGA PCIexpress GPU CPU PCIexpress
  12. 12. 18-19/7/2013 MATENVMED Plenary Meeting Silicon OpenCL • Silicon-OpenCL “SOpenCL”. • A tool flow to convert an unmodified OpenCL application into a SoC design with HW/SW components.
  13. 13. 18-19/7/2013 Contribution • Architectural Synthesis methodology: – Code Transformations. – Architectural Template. 13MATENVMED Plenary Meeting
  14. 14. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL for Heterogeneous Systems • OpenCL (Open Computing Language) : A unified programming model aims at letting a programmer write a portable program once and deploy it on any heterogeneous system with CPUs and GPUs. • Became an important industry standard after release due to substantial industry support. 14
  15. 15. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Platform Model One host and one or more Compute Devices (CD) Each CD consists of one or more Compute Units (CU) Each CU is further divided into one or more Processing Elements (PE) 15 Main Program Computations Kernels
  16. 16. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Kernel Execution Geometry • OpenCL defines a geometric partitioning of grid of computations • Grid consists of N dimensional space of work-groups • Each work-group consists of N dimensional space of work-items. work-group grid work-item 16
  17. 17. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Simple Example __kernel void vadd( __global int* a, __global int* b, __global int* c) { int idx= get_global_id(0); c[idx] = a[idx] + b[idx]; } • OpenCL kernel describes the computation of a work- item • Finest parallelism granularity • e.g. add two integer vectors (N=1) void add(int* a, int* b, int* c) { for (int idx=0; idx<sizeof(a); idx++) c[idx] = a[idx] + b[idx]; } C code OpenCL kernel code Run-time call Used to differentiate execution for each work-item 17
  18. 18. 18-19/7/2013 MATENVMED Plenary Meeting Why OpenCL as an HDL? • OpenCL exposes parallelism at the finest granularity – Allows easy hardware generation at different levels of granularity – One accelerator per work-item, one accelerator per work- group, one accelerator per multiple work-groups, etc. • OpenCL exposes data communication – Critical to transfer and stage data across platforms • We target unmodified OpenCL to enable hardware design to software engineers – No need for hardware/architectural expertise 18
  19. 19. 18-19/7/2013 MATENVMED Plenary Meeting SOpenCL Tool Flow
  20. 20. Granularity Management work-group FPGA Optimal thread granularity depends on hardware platform GPU CPU We select a hardware accelerator to process one work-group per invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting
  21. 21. 18-19/7/2013 MATENVMED Plenary Meeting Granularity Coarsening Work-item thread Work-group thread
  22. 22. 18-19/7/2013 MATENVMED Plenary Meeting Serialization of Work Items __kernel void vadd(…) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } __kernel void Vadd(…) { int idx; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { idx = get_item_gid(0); c[idx] = a[idx] + b[idx]; } } OpenCL code C code 22 idx = (global_id2(0) + i) * Grid_Width * Grid_Height + (global_id1(0) + j) * Grid_Width + (global_id0(0) + k);
  23. 23. 18-19/7/2013 MATENVMED Plenary Meeting Architectural Synthesis • Exploit available parallelism and application specific features. • Apply a series of transformations to generate customized hardware accelerators. 23 • Uses LLVM Compiler Infrastructure. • Generate synthesizable Verilog & Test bench.
  24. 24. Feed Data in Order Write Data in Order •FU types, •Bitwidths, •I/O Bandwidth 2407/27/13 Verilog Generation: PE Architecture •Predication •Code •slicing •SMS mod •scheduling •Verilog •generation MATENVMED Kickc Off Meeting
  25. 25. Roadmap for FPGA implementation
  26. 26. FPGA Implementation • Our plan is to use the same code base (e.g. OpenCL) to explore different architectures – OpenCL used for multicore CPU, GPU, FPGA (SOpenCL) • Fast exploration based on area, performance and power requirements 18-19/7/2013 MATENVMED Plenary Meeting
  27. 27. FPGA Implementation • Monte-Carlo simulations can exploit multi-level parallelism of FPGAs – Multiple MC simulations per point – Multiple points simultaneously – Double precision Trigonometric, Log, Additions, Multiplications functions for each walk – FP operations with double precision are not FPGAs strong point, but still SOpenCL can handle it. 18-19/7/2013 MATENVMED Plenary Meeting

×