TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

1,836 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,836
On SlideShare
0
From Embeds
0
Number of Embeds
795
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

  1. 1. May 1, 2013 1OpenCL for ALTERA FPGAsAccelerating performance and designproductivityLiad Weinberger – AppiloMay 1st, 2013
  2. 2. May 1, 2013 2Technology trends• Over the past years– Technology scaling favors programmability and parallelismFine-GrainedMassivelyParallelArraysSingle Cores Coarse-GrainedMassivelyParallelProcessorArraysMulti-CoresCoarse-GrainedCPUs and DSPsCPUs DSPs Multi-Cores Array GPGPUs FPGAs
  3. 3. May 1, 2013 3Technology trends0204060801001201402000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022Process node (nm)• Moore’s law still in effect– More FPGA real-estate• More potential for parallelism – an extremely good thing!• Designs that utilize this real-estate, becomes harder tomanage and maintain – this is not so good...
  4. 4. May 1, 2013 4Technology trends2007 2008 2009 2010 2011 2012 2013Google trendsWorldwide Interest over the yearsVerilog + VHDL• Decreased interest– Number of Google searches for VHDL orVerilog in decline
  5. 5. May 1, 2013 5Technology trends2007 2008 2009 2010 2011 2012 2013Google trendsInterest over the yearsVerilog + VHDLPython• Software development keeps momentum– Number of Google searches for Python (as arepresenting language)
  6. 6. May 1, 2013 6FPGA (hardware) development• Design (programming) is complex– Define state machine, data-paths, arbitration, IP interfaces, etc.– Sophisticated iterative compilation process• Synthesis, technology mapping, clustering, placement and routing, timing closure• Leads to long compilation times (hours vs. minutes in software)– Debug process is also very time-consuming• Code is not portable– Written in Verilog / VHDL• Can’t re-target for CPUs, GPUs, DSPs, etc.• Not scalableCompilationHDLTimingClosureSetConstraints
  7. 7. May 1, 2013 7Software development• Programming is straight-forward– Ideas are expressed in languages such as C/C++/Python/etc.• Typically, start with simple sequential implementation• Use parallel APIs / language extensions, in order to exploit multi-corearchitectures for additional performance– Compilation times are usually reasonably short• Simple straight-forward compilation/linking process– Immediate feedback when debugging/profiling• An assortment of tools available for both debugging and profiling• Portability is still an issue– Possible, but require pre-planningCompiler&LinkerC/C++Pythonetc.C/C++Pythonetc.C/C++/Python/etc.
  8. 8. May 1, 2013 8Product development point-of-view• Product producers want:– Lower development and maintenance costs– Competitive edge• Higher performance• Short time-in-market, and short time-to-market– Agile development methods are becoming more and more popular– Can’t afford long development cycles– Trained developers with established experience• Or cost-effective path for training new developers– Flexibility• No vendor-locking is preferred• Ability to rapidly adapt product to market requirement changes
  9. 9. May 1, 2013 9Our challenge• How do we bring FPGA design process closer to thesoftware development model?– Need to make FPGAs more accessible to the software developmentcommunity• Change in mind-set: look at FPGAs as massively multi-core devices thatcould be used in order to accelerate parallel applications• A programming model that allows that• Shorter compilation times and faster feedback for debugging and profilingthe design
  10. 10. May 1, 2013 10An ideal programming environment...• Based on a standard programming model– Rather than something which is FPGA-specific• Abstracts away the underlying details of the hardware– VHDL / Verilog are similar to “assembly language” programming– Useful in rare circumstances where the highest possible efficiency is needed• The price of abstraction is not too high– Still need to efficiently use the FPGA’s resources to achieve high throughput / lowarea• Allows for software-like compilation & debug cycles– Faster compile times– Profiling & user feedback
  11. 11. May 1, 2013 11Introducing OpenCLParallel heterogeneous computing
  12. 12. May 1, 2013 12A case for OpenCL• What is OpenCL?– An open, royalty-free standard for cross-platform parallel software programming ofheterogeneous systems• CPU + DSPs• CPU + GPUs• CPU + FPGAs– Maintained by KHRONOS group• An industry consortium creating open, royalty-free standards• Comprised of hardware and software vendors– Enables software to leverage silicon acceleration• Consists of two major parts:– Application Programming Interface (API) for device management– Device programming language based on C99 withsome restrictions and extensions to support explicit parallelismOr maybe all together
  13. 13. May 1, 2013 13Benefits of OpenCL• Cross-vendor software portability– Functional portability—Same code would normally execute ondifferent hardware, by different vendors– Not performance portable—Code still needs to be optimized tospecific device (at least a device class)• Allows for the management of available computationalresources under a single framework– Views CPUs, GPUs, FPGAs, and other accelerators as devices thatcould carry the computational needs of the application
  14. 14. May 1, 2013 14OpenCL program structure• Separation between managerial and computational code bases– Managerial code executes on a host CPU• Any type of conventional micro-processor• Written in any language that has bindings for the OpenCL API– The API is in ANSI-C– There is a formal C++ binding– Other bindings may exist– Computational code executes on the compute devices (accelerators)• Written in a language called OpenCL C– Based on C99– Adds restrictions and extensions for explicit parallelism• Can be compiled either offline, or online, depending on implementation• Will most likely consist only of those portions of the application we want to accelerate
  15. 15. May 1, 2013 15OpenCL program structureCompute DeviceHostLocalMemGlobalMemLocalMemLocalMemLocalMemAcceleratorAcceleratorAcceleratorComputeunit__kernel voidsum(__global float *a,__global float *b,__global float *y){int gid = get_global_id(0);y[gid] = a[gid] + b[gid];}main() {read_data( … );maninpulate( … );clEnqueueWriteBuffer( … );clEnqueueNDRangeKernel(…,sum,…);clEnqueueReadBuffer( … );display_result( … );}Host ProgramKernel Program
  16. 16. May 1, 2013 16OpenCL host application• Communicates with the Accelerator Device via a set oflibrary routines– Abstracts away host processor to HW accelerator communication viaa set of API callsmain() {read_data( … );maninpulate( … );clEnqueueWriteBuffer( … );clEnqueueNDRangeKernel(…,sum,…);clEnqueueReadBuffer( … );display_result( … );}Copy dataHost  FPGAAsk the FPGA to runa particular kernelCopy dataFPGA  Host
  17. 17. May 1, 2013 17OpenCL kernels• Data-parallel function– Executes by many parallelthreads• Each thread has an identifierwhich could be obtained witha call to the get_global_id()built-in function• Uses qualifiers to definewhere memory buffers reside• Executed by acompute device– CPU– GPU– FPGA– Other acceleratorfloat *a =float *b =float *y =0 1 2 3 4 5 6 77 6 5 4 3 2 1 07 7 7 7 7 7 7 7__kernel voidsum(__global float *a,__global float *b,__global float *y){int gid = get_global_id(0);y[gid] = a[gid] + b[gid];}__kernel void sum( … );
  18. 18. May 1, 2013 18OpenCL on FPGAsHow does it map?
  19. 19. May 1, 2013 19Compiling OpenCL to FPGAsx86PCIeSOF X86 binaryACLCompilerStandardC CompilerOpenCLHost Program + Kernels__kernel voidsum(__global float *a,__global float *b,__global float *y){int gid = get_global_id(0);y[gid] = a[gid] + b[gid];}Kernel Programs Host Programmain() {read_data( … );maninpulate( … );clEnqueueWriteBuffer( … );clEnqueueNDRangeKernel(…,sum,…);clEnqueueReadBuffer( … );display_result( … );}
  20. 20. May 1, 2013 20Compiling OpenCL to FPGAsLoad LoadStoreLoad LoadStoreLoad LoadStoreLoad LoadStoreLoad LoadStoreLoad LoadStorePCIeDDRx__kernel voidsum(__global float *a,__global float *b,__global float *y){int gid = get_global_id(0);y[gid] = a[gid] + b[gid];}Kernel ProgramsCustom Hardware for Your Kernels
  21. 21. May 1, 2013 21FPGA architecture for OpenCLFPGAKernelPipelineKernelPipelineKernelPipelinePCIeDDR*x86 /ExternalProcessorExternalMemoryController& PHYMemoryMemoryMemoryMemoryMemoryMemoryGlobal Memory InterconnectLocal Memory InterconnectExternalMemoryController& PHYKernel System
  22. 22. May 1, 2013 22Mapping multithreaded kernels to FPGAs• Simplest way of mapping kernel functions to FPGAs isto replicate hardware for each thread– Inefficient and wasteful• Technique: deep pipeline parallelism– Attempt to create a deeply pipelined representation of a kernel– On each clock cycle, we attempt to send in input data for a newthread– Method of mapping coarse grained thread parallelism to fine-grainedFPGA parallelism
  23. 23. May 1, 2013 23Example pipeline for vector add• On each cycle, the portions ofthe pipeline are processingdifferent threads• While thread 2 is being loaded,thread 1 is being added, andthread 0 is being storedLoad LoadStore0 1 2 3 4 5 6 78 threads for vector add exampleThread IDs+
  24. 24. May 1, 2013 24Example pipeline for vector add• On each cycle, the portions ofthe pipeline are processingdifferent threads• While thread 2 is being loaded,thread 1 is being added, andthread 0 is being storedLoad LoadStore1 2 3 4 5 6 708 threads for vector add exampleThread IDs+
  25. 25. May 1, 2013 25Example pipeline for vector add• On each cycle, the portions ofthe pipeline are processingdifferent threads• While thread 2 is being loaded,thread 1 is being added, andthread 0 is being storedLoad LoadStore2 3 4 5 6 7018 threads for vector add exampleThread IDs+
  26. 26. May 1, 2013 26Example pipeline for vector add• On each cycle, the portions ofthe pipeline are processingdifferent threads• While thread 2 is being loaded,thread 1 is being added, andthread 0 is being storedLoad LoadStore3 4 5 6 7128 threads for vector add exampleThread IDs+0
  27. 27. May 1, 2013 27Example pipeline for vector add• On each cycle, the portions ofthe pipeline are processingdifferent threads• While thread 2 is being loaded,thread 1 is being added, andthread 0 is being storedLoad LoadStore4 5 6 70238 threads for vector add exampleThread IDs+1
  28. 28. May 1, 2013 28Some examplesUsing ALTERA’s OpenCL solution
  29. 29. May 1, 2013 29AES encryption• Counter (CTR) based encryption/decryption– 256-bit key• Advantage FPGA– Integer arithmetic– Coarse grain bit operations– Complex decision making• Results Platform Throughput (GB/s)E5503 Xeon Processor 0.01 (single core)AMD Radeon HD 7970 0.33PCIe385 A7 Accelerator 5.2042% utilization (2 kernels)•Power conservation•Fill up for even higher performance
  30. 30. May 1, 2013 30Multi-asset barrier option pricing• Monte-carlo simulation– Heston model– ND range• Assets x paths (64x1000000)• Advantage FPGA– Complex control flow• Results  ttttStttttdWdtddWSdtSdSPlatformPower(W)Performance(Msims/s)Msims/WW3690 Xeon Processor 130 32 0.25nVidia Tesla C2075 225 63 0.28PCIe385 D5 Accelerator 23 170 7.40
  31. 31. May 1, 2013 31Document filtering• Unstructured data analytics– Bloom Filter• Advantage FPGA– Integer arithmetic– Flexible memory configuration• Results Platform Power (W) Performance (MTs) MTs/WW3690 Xeon Processor 130 2070 15.92nVidia Tesla C2075 215 3240 15.07DE4 Stratix IV-530 Accelerator 21 1755 83.57PCIe385 A7 Accelerator 25 3602 144.08
  32. 32. May 1, 2013 32Fractal video compression• Best matching codebook– Correlation with SAD• Advantage FPGA– Integer arithmetic• Results Platform Power (W) Performance (FPS) FPS/WW3690 Xeon Processor 130 4.6 0.035nVidia Tesla C2075 215 53.1 0.247DE4 Stratix IV-530 Accelerator 21 70.9 3.376PCIe385 A7 Accelerator 25 74.4 2.976

×