Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Programming Models for Heterogeneous Chips

1,516 views

Published on

Charla del Profesor Rafael Asenjo Plaza impartida el 24 de Octubre en la Facultad de Informática

Published in: Education
  • Be the first to comment

Programming Models for Heterogeneous Chips

  1. 1. Programming Models for Heterogeneous Chips Rafael Asenjo Dept. of Computer Architecture University of Malaga, Spain.
  2. 2. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 2
  3. 3. Motivation • A new mantra: Power and Energy saving • In all domains 3
  4. 4. Motivation • GPUs came to rescue: – Massive Data Parallel Code at a low price in terms of power – Supercomputers and servers: NVIDIA • GREEN500 Top 15: • TOP500: – 45 systems w. NVIDIA – 19 systems w. Xeon Phi 4
  5. 5. Motivation • There is (parallel) live beyond supercomputers: 5
  6. 6. Motivation • Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors 6
  7. 7. Motivation • Plenty of GPUs on desktops and laptops: – Desktops (35 – 130W) and laptops (15– 57 W): 7 Intel Haswell AMD APU Kaveri http://www.techspot.com/photos/article/http://techguru3d.com/4th-gen-intel-haswell-processors- 770-amd-a8-7600-kaveri/ architecture-and-lineup/
  8. 8. Motivation 8
  9. 9. Motivation • Plenty of integrated GPUs in mobile devices. Samsung Galaxy S5 SM-G900H 9 Samsung Exynos 5 Octa (2 - 6 W) Samsung Galaxy Note Pro 12 http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/
  10. 10. Motivation • Plenty of integrated GPUs in mobile devices. 10 Qualcomm Snapdragon 800 (2 - 6 W) https://www.qualcomm.com/products/snapdragon/processors/800 Nexus 5 Nokia Lumia Sony Xperia
  11. 11. Motivation • Plenty of room for improvements – Want to make the most out of the CPU and the GPU – Lack of programming models – “Heterogeneous exec., but homogeneous programming” – Huge potential impact • Servers and supercomputing market – Google: porting the search engine for ARM and PowerPC – AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) – Mont Blanc project: supercomputer made of ARM • Once commodity processors took over • Be prepared for when mobile processors do so – E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20 11
  12. 12. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 12
  13. 13. Hardware 13 Intel Haswell AMD Kaveri Samsung Exynos 5 Octa Qualcomm Snapdragon 800
  14. 14. Intel Haswell • Modular design 14 – 2 or 4 cores – GPU • GT-1: 10 EU • GT-2: 20 EU • GT-3: 40 EU • TSX: HW transac. mem. – HLE (HW lock elis.) • XACQUIRE • XRELEASE – RTM (Restrtd. TM) • XBEGIN • XEND http://www.anandtech.com/show/6355/intels-haswell-architecture
  15. 15. Intel Haswell • Three frequency domains – Cores – GPU – LLC and Ring • On old Ivy Bridge – Only 2 domains – Cores and LLC together – Only GPU à CPU Fz é • OpenCL driver only for Win. • PCM as power monitor 15 http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014
  16. 16. Intel Iris Graphics https://software.intel.com/en-us/articles/opencl-fall-webinar-series 16
  17. 17. Intel Iris Graphics • GPU slice 17 – 2 sub slices – 20 EU (GPU cores) – Local L3 cache (256KB) – 16 barriers per sub slice – 2 x 64KB Local mem. • 2 GPU slices = 40 EU • Up to 7 in-flight EU-th • 8, 16 or 32 SIMD per EU-th • In flight:7x40x32=8960 work it • Each EU à 2 x 4-wide FPU – 40x8x2 (fmadd) = 640 op sim. – 1.3GHz à 832GFlops
  18. 18. Intel Iris GPU 18 Matrix work-group ≈ block EU-threads (SIMD16) ≈ warp ≈ wavefront
  19. 19. AMD Kaveri • Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores. 19 http://wccftech.com/
  20. 20. AMD Kaveri • Steamroller microarch. – Each moduleà 2 “Cores”. – 2 threads, each with • 4x superscalar INT • 2x SIMD4 FP – 3.7GHz • Max GFLOPS: • 3.7 GHz x • 4 threads x • 4 wide x • 2 fmad = • 118 GFLOPS 20
  21. 21. AMD Graphics Core Next (GCN) • In Kaveri, GCG takes 47% of the die – 8 Compute Units (CU) – Each CU: 4 SIMD16 – Each SIMD16: 16 lines – Total: 512 FPUs – 720 MHz • Max GFLOPS= • 0.72 GHz x • 512 FPUs x • 2 fmad = • 737 GFLOPS • CPU+GPU à 855 GFLOPS 21
  22. 22. OpenCL execution on GCN Work-group à wavefronts (64 work-items) à pools 22 WG CU0 pool number SIMD0-0 SIMD0 SIMD1 SIMD2 SIMD3 SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2 4 pools: 4 wavefronts in flight per SIMD 4 ck to execute each wavefront Wavefronts
  23. 23. HSA (Heterogeneous System Architecture) • HSA Foundation’s goal: Productivity on heterogeneous HW – CPU, GPU, DSPs.. • Scheduled on three phasesà • Second phase: Kaveri – hUMA – Same pointers used on CPU and GPU – Cache coherency 23
  24. 24. Kaveri’s main HSA features • hUMA – Shared and coherent view of up to 32GB • Heterogeneous queuing (hQ) – CPU and GPU can create and dispatch work 24
  25. 25. HSA Motivation • Too many steps to get the job done 25 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  26. 26. Requirements • To enable lower overhead job dispatch requires four mechanisms: – Shared Virtual Memory • Send pointers (not data) back and forth between HSA agents. – System Coherency • Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance – Signaling • HSA Agents can directly create/access signal objects. – Signaling a signal object (this will wake up HSA agents waiting upon the object) – Query current object – Wait on the current object (various conditions supported). – User mode queueing • Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents. 26
  27. 27. Non-HSA Shared Virtual Memory • Multiple Virtual memory address spaces PHYSICAL MEMORY VIRTUAL MEMORY1 27 PHYSICAL MEMORY VIRTUAL MEMORY2 CPU0 GPU VA1->PA1 VA2->PA1
  28. 28. HSA Shared Virtual Memory • Common Virtual Memory for all HSA agents 28 PHYSICAL MEMORY VIRTUAL MEMORY CPU0 GPU VA->PA VA->PA
  29. 29. After adding SVM • With SVM we get rid of copy/map memory back and forth 29 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  30. 30. After adding coherency • If the CPU allocates a global pointer, the GPU see that value 30 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  31. 31. After adding signaling • The CPU can wait on a signal object 31 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  32. 32. After adding user-level enqueuing • The user directly enqueues the job without OS intervention 32 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  33. 33. Success!! • That’s definitely way simpler and with less overhead 33 Application OS GPU Queue Job Start Job Finish Job
  34. 34. OpenCL 2.0 • OpenCL 2.0 will contain most of the features of HSA – Intel’s version supports HSA for Core M (Broadwell). Windows. – AMD’s version does not support SVM fine grain. • AMD 1.2 beta driver – http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/ – Only for Windows 8.1 – Example: allocating “Coherent Host Memory” on Kaveri: 34 #include <CL/cl_ext.h> // Implements SVM #include "hsa_helper.h” // AMD helper functions … cl_svm_mem_flags_amd flags = CL_MEM_READ_WRITE | CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD | CL_MEM_SVM_ATOMICS_AMD; volatile std::atomic_int * data; data = (volatile std::atomic_int *) clSVMAlloc(context, flags, MAX_DATA * sizeof(volatile std::atomic_int), 0);
  35. 35. Samsung Exynos 5 • Odroid XU-E and XU3 bareboards • Sports a Exynos 5 Octa 35 – big.LITTLE architecture – big: Cortex-A15 quad – LITTLE: Cortex-A7 quad • Exynos 5 Octa 5410 – Only 4 CPU-cores active at a time – GPU: Power VR SGX544MP3 (Imagination Technologies) • 3 GPU-cores at 533MHz à 51 GFLOPS • Exynos 5 Octa 5422 – All 8 CPU-cores can be working simultaneously – GPU: ARM Mali-T628 MP6 • 6 GPU-cores at 533MHz à 102 GFLOPS 180$
  36. 36. Power VR SGX544MP3 • OpenCL 1.1 for Android • Some limitations: – Compute units: 1 – Max WG Size: 1 – Local Mem: 1KB – Pick MFLOPS: • 12 Ops per ck • 3 SIMD-ALUs x 4-wide • Power monitor: – 4 x INA231 monitors • A15, A7, GPU, Mem. • Instant Power • Every 260ms 36 SGX architecture …N
  37. 37. Texas Instrument INA231 37
  38. 38. ARM Mali-T628 MP6 • Supporting: – OpenGL® ES 3.0 – OpenCL™ 1.1 – DirectX® 11 – Renderscript™ • Cache L2 size – 32 – 256KB per core • 6 Cores – 16 FP units – 2 SIMD4 each • Other – Built-in MMU – Standard ARM Bus • AMBA 4 ACE-Lite 38 Mali architecture
  39. 39. Qualcomm Snapdragon 39
  40. 40. Snapdragon 800 40
  41. 41. Snapdragon 800 • CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) – Similar to Cortex-A15. 11 stage integer pipeline with 3-way decode and 4-way out-of-order speculative issue superscalar execution – Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) – 4 KB + 4 KB direct mapped L0 cache – 16 KB + 16 KB 4-way set associative L1 cache – 2 MB (quad-core) L2 cache • GPU: Adreno 330, 450MHz – OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript – 32 Execution Units. Each with 2 x SIMD4 units • DSP: Hexagon 600MHz 41
  42. 42. Measuring power • Snapdragon Performance Visualizer • Trepn Profiler • Power Tutor – Tuned for Nexus One – Model with 5% precision – Open Source 42
  43. 43. More development boards • Jetson TK1 board – Tegra K1 – Kepler GPU with 192 CUDA cores – 4-Plus-1 quad-core ARM Cortex A15 – Linux + CUDA – 180$ • Arndale – Exynos 5420 – big.LITTLE (A15 + A7) – GPU Mali T628 MP6 – Linux + OpenCL – 200$ • … 43
  44. 44. Advantages of integrated GPUs • Discrete and integrated GPUs: different goals – NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS – Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS – PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS • Higher bandwidth between CPU and GPU. – Shared DRAM • Avoid PCI data transfer – Shared LLC (Last Level Cache) • Data coherence in some cases… • CPU and GPU may have similar performance – It’s more likely that they can collaborate • Cheaper! 44
  45. 45. Integrated GPUs are also improving 45
  46. 46. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 46
  47. 47. Programming models for heterogeneous • Targeted at single device – CUDA (NVIDIA) – OpenCL (Khronos Group Standard) – OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) – C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) – RenderScript (Google’s Java API for Android) – ParallDroid (Java + Directives from ULL, Spain) – Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …) • Targeted at several devices (discrete GPUs) – Qilin (C++ and Qilin API compiled to TBB+CUDA) – OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) – XKaapi – StarPU • Targeted at several devices (integrated GPUs) – Qualcomm MARE – Intel Concord 47
  48. 48. OpenCL on mobile devices 48 http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/
  49. 49. OpenCL running on CPU 49 CPU Ivy Bridge 45% 41% 75% 15% 9% 34% 10% 50 45 40 35 30 25 20 15 10 5 0 Base Auto T-Auto SSE AVX-SSE AVX OpenCL Execution Time (ms) 3.3 GHz AVX code version is - 1.8x times faster than OpenCL - 1.8x more Halstead effort “Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture and Code Optimization (TACO),10(4), December 2013.
  50. 50. Complexities of AVX Intrinsics __m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);! curr_filter = _mm256_load_ps(&fb_array[fi]);! temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, " " curr_filter), temp_sum);! temp_sum2 = _mm256_insertf128_ps(temp_sum,! " _mm256_extractf128_ps(temp_sum, 1), 0);! cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS);! r = _mm256_movemask_ps(cpm);! ! if(r&(1<<1)) {! "best_ind = filter_ind+2;! "int control = 1|(1<<2)|(1<<4)|(1<<6);;! "max_fil = _mm256_permute_ps(temp_sum2, control); ! " " " " " " " "r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, ! " "max_fil, _CMP_GT_OS));! }! 50 Load Multiply-add Copy high to low Compare Store max Store index
  51. 51. OpenCL doesn’t have to be tough Courtesy: Khronos Group 51
  52. 52. Libraries and languages using OpenCL Courtesy: AMD 52
  53. 53. Libraries and languages using OpenCL Courtesy: AMD 53
  54. 54. Libraries and languages using OpenCL (cont.) Courtesy: AMD 54
  55. 55. Libraries and languages using OpenCL (cont.) Courtesy: AMD 55
  56. 56. C++AMP • C++ Accelerated Massive Parallelism • Pioneered by Microsoft – Requirements: Windows 7 + Visual Studio 2012 • Followed by Intel's experimental implementation – C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) • Now HSA Foundation taking the lead • Keywords: restrict(device), array_view, parallel_for_each,… – Example: SUM = A + B; // (2D arrays) 56
  57. 57. OpenCL Ecosystem Courtesy: Khronos Group 57
  58. 58. SYCL’s flavour: A[i]=B[i]*2 58 Work in progress developments: - AMD: trySYCL à https://github.com/amd/triSYCL - Codeplay: http://www.codeplay.com/ Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. Barriers are automatically deduced
  59. 59. StarPU • A runtime system for heterogeneous architectures • Dynamically schedule tasks on all processing units – See a pool of heterogeneous cores • Avoid unnecessary data transfers between accelerators – Software SVM for heterogeneous machines 59 CPU CPU CPU CPU CPU CPU A M CPU CPU M GPU M GPU M GPU M GPU M A = A+B B B
  60. 60. Overview of StarPU • Maximizing PU occupancy, minimizing data transfers • Ideas: – Accept tasks that may have multiple implementations 60 • Together with potential inter-dependencies – Leads to a dynamic acyclic graph of tasks – Provide a high-level data management layer (Virtual Shared Memory VSM) • Application should only describe – which data may be accessed by tasks – how data may be divided Applica0ons Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU …
  61. 61. Tasks scheduling • Dealing with heterogeneous hardware accelerators • Tasks = 61 – Data input & output – Dependencies with other tasks – Multiple implementations • E.g. CUDA + CPU • Scheduling hints • StarPU provides an Open Scheduling platform – Scheduling algorithm = plug-ins – Predefined set of popular policies Applica0ons Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU … f (ARW, BR) cpu gpu spu
  62. 62. Tasks scheduling • Predefined set of popular policies • Eager Scheduler 62 – First come, first served policy – Only one queue • Work Stealing Scheduler – Load balancing policy – One queue per worker • Priority Scheduler – Describe the relative importance of tasks – One queue per priority CPU CPU CPU GPU GPU Eager Scheduler task task CPU CPU CPU GPU GPU WS Scheduler task CPU CPU CPU GPU GPU Prio. Scheduler prio2 prio1 prio0
  63. 63. Tasks scheduling • Predefined set of popular policies • Dequeue Model (DM) Scheduler 63 – Using codelet performance models • Kernel calibration on each available computing device – Raw history model of kernels’ past execution times – Refined models using regression on kernels’ execution times history • Dequeue Model Data Aware (DMDA) Scheduler – Data transfer cost vs kernel offload benefit – Transfer cost modelling ( ) – Bus calibration task cpu1 cpu2 cpu3 gpu1 gpu2 time CPU CPU CPU GPU GPU DM Scheduler cpu1 cpu2 cpu3 gpu1 gpu2 task cpu1 cpu2 cpu3 gpu1 gpu2 time CPU CPU CPU GPU GPU DMDA Scheduler cpu1 cpu2 cpu3 gpu1 gpu2
  64. 64. Some results (MxV, 4 CPUs, 1 GPU) SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU 64
  65. 65. Terminology • A Codelet. . . – . . . relates an abstract computation kernel to its implementation(s) – . . . can be instantiated into one or more tasks – . . . defines characteristics common to a set of tasks • A Task. . . – . . . is an instantiation of a Codelet – . . . atomically executes a kernel from its beginning to its end – . . . receives some input – . . . produces some output • A Data Handle. . . – . . . designates a piece of data managed by StarPU – . . . is typed (vector, matrix, etc.) – . . . can be passed as input/output for a Task 65
  66. 66. Basic Example: Scaling a Vector 66 kernel Declaring a Codelet functions data pieces data mode access 123456 struct starpu_codelet scal_cl = { . cpu_funcs = { scal_cpu_f, NULL}, . cuda_funcs = { scal_cuda_f, NULL } , . nbuffers = 1, . modes = { STARPU_RW } , }; 1 2 3 4 5 6 7 8 9 Kernel functions void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); float ∗ptr_factor = cl_arg ; for (i = 0; i < NX; i++) vector [ i ] ∗= ∗ptr_factor ; } void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … } kernel function prototype retrieve data handle get pointer from data handle get small-size inline data do computation
  67. 67. Basic Example: Scaling a Vector 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 67 Main code float factor = 3.14; float vector1 [NX] ; float vector2 [NX] ; starpu_data_handle_t vector_handle1 ; starpu_data_handle_t vector_handle2 ; /∗ ..... ∗/ starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, NX, sizeof(vector1[0])); starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, NX, sizeof(vector2[0])); /∗ non−blocking task submits ∗/ starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; /∗ wait for all task submitted so far ∗/ starpu_task_wait_for_all () ; starpu_data_unregister ( vector_handle1 ) ; starpu_data_unregister ( vector_handle2 ) ; /∗ ..... ∗/ declare data handles register pieces of data and get the handles (now under StarPU control) submit tasks (param: codelet, StarPU-managed data, small-size inline data) wait for all submitted tasks Unregister pieces of data (the handles are destroyed, the vectors are now back under user control)
  68. 68. Qualcomm MARE • MARE is a programming model and a runtime system that provides simple yet powerful abstractions for parallel, power-efficient software – Simple C++ API allows developers to express concurrency – User-level library that runs on any Android device, and on Linux, Mac OS X, and Windows platforms • The goal of MARE is to reduce the effort required to write apps that fully utilize heterogeneous SoCs • Concepts: – Tasks are units of work that can be asynchronously executed – Groups are sets of tasks that can be canceled or waited on 68
  69. 69. Basic Example: Hello World 69
  70. 70. More complex example: C=A+B on GPU 70
  71. 71. More complex example: C=A+B on GPU 71
  72. 72. MARE departures • Similarities with TBB – Based on tasks and 2-level API (task level and templates) • pfor_each, ptransform, pscan, … • Synchronous Dataflow classes ≈ TBB’s Flow Graphs – Concurrent data structures: queue, stack, … • Departures – Expression of dependencies is first class – Flexible group membership and work or group cancelation – Optimized for some Qualcomm chips • Power classes: – Static: mare::power::mode {efficient, saver, …} – Dynamic: mare:power::set_goal(desired, tolerance) • Aware of the mobile architecture: agressive power mangmt. – Cores can be shutdown or affected by DVFS 72
  73. 73. MARE results • Zoomm web browser implemented on top of MARE C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013. 73
  74. 74. MARE results • Bullet Physics parallelized with MARE Courtesy: Calin Cascaval 74
  75. 75. Intel Concord • C++ heterogeneous programming framework for integrated CPU and GPU processors – Shared Virtual Memory (SVM) in software – Adapts existing data-parallel C++ constructs to heterogeneous computing using TBB – Available open source as Intel Heterogeneous Research Compiler (iHRC) at https://github.com/IntelLabs/iHRC/ • Papers: – Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of irregular C++ applications to integrated GPUs. CGO 2014. – Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014. 75
  76. 76. Intel Concord • Extend TBB API: – parallel_for_hetero (int numiters, const Body &B, bool device); – parallel_reduce_hetero (int numiters, const Body &B, bool device); Courtesy: Intel 76
  77. 77. Example: Parellel_for_hetero • Concord compiler generates OpenCL version – Automatically takes care of the data thanks to SVM Courtesy: Intel 77
  78. 78. Concord framework Courtesy: Intel 78
  79. 79. SVM SW implementation on Haswell 79
  80. 80. SVM translation in OpenCL code • svm_const is a runtime constant and is computed once • Every CPU pointer before dereference on the GPU is converted into GPU address space using AS_GPU_PTR 80
  81. 81. Concord results 81
  82. 82. Speedup & Energy savings vs multicore CPU 82
  83. 83. Heterogeneous execution on both devices • Iteration space distributed among available devices • Problem: find the best data partition • Example: Barnes Hut and Facedetect relative exect. time – Varying the amount of work offloaded to the GPU – For BH the optimum is 40% of the work carried out on the GPU – For FD the optimum is 0% of the work carried out on the GPU 83
  84. 84. Partitioning based on on-line profiling Naïve profiling Asymmetric profiling 84 assign chunk to CPU and GPU compute chunk on CPU compute chunk on GPU barrier according to relative speeds partition the rest of the iteration space assign chunk to just to GPU compute on CPU compute chunk on GPU when the GPU is done according to relative speeds partition the rest of the iteration space
  85. 85. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 85
  86. 86. Our heterogeneous parallel_for Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, The Journal of Supercomputing, May 2014 86
  87. 87. Comparison with StarPU • MxV benchmark – Three schedulers tested: greedy, work-stealing, HEFT – Static chunk size: 2000, 200 and 20 matrix rows 91
  88. 88. Choosing the GPU block size • Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20. 92
  89. 89. GPU block size for irregular codes • Adapt between time-steps and inside the time-step 93 100 90 80 70 60 50 40 30 20 10 0 Barnes−Hut: Average throughput per chunk size 0 40 160 640 2560 10240 40980 81960 Chunk size Average Throughput static, ts=0 static, ts=30 adap., ts=0 adap., ts=30
  90. 90. GPU block size for irregular codes • Throughput variation along the iteration space – For two different time-steps – Different GPU chunk-sizes 160 140 120 100 80 60 40 0 2 4 6 8 10 94 4 x 10 20 Barnes−Hut: Throughput variation (time step=0) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560 160 140 120 100 80 60 40 0 2 4 6 8 10 4 x 10 20 Barnes−Hut: Throughput variation (time step=5) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560
  91. 91. Adapting the GPU chunk-size • Assumption: – Irregular behavior as a sequence of regimes of regular behavior 95 throughput higher λG λG a·ln(x)+b λG increase chunk size x decrease chunk size lower λG x G(t-1)/2 G(t-1)*2 x samples G=a/thld x 40 20 1750 1400 1050 700 0 2 4 6 8 10 4 x 10 0 Iteration space GPU Throughput GPU Thr. & Chunk size: LogFit 350 GPU Chunk size Throughput Chunk Size 100 50 3500 2800 2100 1400 0 2 4 6 8 10 4 x 10 0 Iteration space GPU Throughput GPU Thr. & Chunk size: LogFit 700 GPU Chunk size Throughput Chunk Size
  92. 92. Preliminary results: Energy-Performance On Haswell 6 5 4 65 70 75 80 85 90 95 100 105 110 150" 100" 50" Barnes)Hut:)Offline)search)for)sta.c)par..on) Execu2on"2me"in"seconds" • Static: Oracle-Like static partition of the work based on profiling • Concord: Intel approach: GPU size computed once • HDSS: Belviranli et al. approach: GPU size computed once • LogFit: our dynamic CPU and GPU 96 chunk size partitioner 3 x 10−4 Performance (iterations per ms.) Energy per iteration (Joules) Barnes Hut: Energy − Performance Static Concord HDSS LogFit 0" 0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%" Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
  93. 93. Preliminary results: Energy-Performance 13 12 11 10 9 8 7 6 5 x 10−4 • w.r.t. Static: up to 52% (18% on average) • w.r.t. Concord and HDSS: up to 94% and 69% of (28% and 27% on average) 30 35 40 45 50 55 60 65 x 10−6 Performance (iterations per ms.) CFD: Energy − Performance Performance (iterations per ms.) Energy per item (Joules) 97 9 8 7 6 5 4 45 50 55 60 65 70 75 80 85 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 x 10−4 100 150 200 250 300 350 400 450 500 Performance (iterations per ms.) Energy per iteration (Joules) SpMV: Enery −3erformance Static Concord HDSS LogFit 3 x 10−4 Performance (iterations per ms.) Energy per item (Joules) P(3: Energy Performance Static Concord HDSS LogFit 10 9 8 7 6 5 4 3 2 Static Concord HDSS LogFit 4000 5000 6000 7000 8000 9000 10000 Energy per iteration (Joules) Nbody: Energy − Performance Static Concord HDSS LogFit
  94. 94. Our heterogeneous pipeline • ViVid, an object detection application • Contains three main kernels that form a pipeline • Would like to answer the following questions: – Granularity: coarse or fine grained parallelism? – Mapping of stages: where do we run them (CPU/GPU)? – Number of cores: how many of them when running on CPU? – Optimum: what metric do we optimize (time, energy, both)? 98 Stage 1 Stage 2 Stage 3 Input frame Filter Histogram Classifier Output response mtx. index mtx. histograms detect. response
  95. 95. Granularity • Coarse grain: – CG • Medium grain: – MG • Fine grain: CPU item C CPU item C item C CPU item C item C C C C item C C C C GPU item C C C C C C CPU item C GPU item C C C C C C – Fine grain also in the CPU via AVX intrinsics 99 CPU item C item C Input Stage Stage 1 Output Stage CPU Stage 2 CPU Stage 3 C =core Input Stage CPU item C Output Stage CPU Stage 1 CPU Stage 2 CPU item C C C C Stage 3 CPU item C GPU item C C C C C C Input Stage Stage 1 Stage 2 Stage 3 Output Stage
  96. 96. More mappings GPU item C C C C C C GPU CPU item C CPU item C CPU item C C C C item C C C C GPU 100 Input Stage Output Stage Stage 2 CPU Stage 3 Stage 1 CPU GPU CPU GPU CPU GPU CPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU CPU CPU GPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU CPU GPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
  97. 97. Accounting for all alternatives • In general: nC CPU cores, 1 GPU and p pipeline stages item item item item # alternatives = 2p x (nC +2) item CPU item C item item • For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives 101 Input Stage GPU C C C C C C CPU C Output Stage GPU C C C C C C GPU C C C C C C CPU Stage 1 CPU Stage 2 CPU Stage p C C nC cores C C nC cores C C nC cores
  98. 98. Framework and Model • Key idea: 1. Run only on GPU 2. Run only on CPU 3. Analytically extrapolate for heterogeneous execution 4. Find out the best configuration à RUN 102 item Input Stage GPU item C C C C C C CPU C CPU item C Output Stage GPU item C C C C C C GPU item C C C C C C CPU item C C C C Stage 1 CPU item C C C C Stage 2 CPU item C C C C Stage 3 DP-MG collect λ and E (homogeneous values)
  99. 99. Environmental Setup: Benchmarks • Four Benchmarks – ViVid (Low and High Definition inputs) – SRAD – Tracking St 1 St 2 St 3 Out St 1 St 2 St 3 St 4 – Scene Recognition 103 Inpt Filter Histogram Classifier Inpt Extrac. Prep. Reduct. St 5 St 6 Out Comp.1 Comp. 2 Statist. Inpt St 1 Out Track. Inpt St 1 St 2 Out Feature. SVM
  100. 100. ViVid: throughput/energy on Ivy Bridge 1.0E-­‐01 9.0E-­‐02 8.0E-­‐02 7.0E-­‐02 6.0E-­‐02 5.0E-­‐02 4.0E-­‐02 3.0E-­‐02 2.0E-­‐02 1.0E-­‐02 106 LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge 0.0E+00 Num-­‐Threads (CG) 8.0E-­‐04 7.0E-­‐04 6.0E-­‐04 5.0E-­‐04 4.0E-­‐04 3.0E-­‐04 2.0E-­‐04 1.0E-­‐04 0.0E+00 Num-­‐Threads (CG) GPU item C C C C C C C =Core CPU item C CPU item C item item C item C Input Stage Output Stage Stage 1 CPU C CPU Stage 2 CPU Stage 3 CP-CG GPU-CPU Path CPU Path GPU item C C C C C C CPU item C CPU item C CPU item C C C C Stage 1 CPU item C C C C item C C C C Input Stage Stage 2 Output Stage CPU Stage 3 CP-MG C =Core
  101. 101. ViVid: throughput/energy on Haswell 1.8E-­‐01 1.6E-­‐01 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 107 LD, Haswell HD, Haswell 0.0E+00 Num-­‐Threads (CG) 1.0E-­‐03 8.0E-­‐04 6.0E-­‐04 4.0E-­‐04 2.0E-­‐04 0.0E+00 Num-­‐Threads (CG) GPU item C C C C C C CPU item C CPU item C CPU item C C C C Stage 1 CPU item C C C C item C C C C Input Stage Stage 2 Output Stage CPU Stage 3 CP-MG C =Core
  102. 102. Does Higher Throughput imply Lower Energy? 9.0E-­‐03 8.0E-­‐03 7.0E-­‐03 6.0E-­‐03 5.0E-­‐03 4.0E-­‐03 3.0E-­‐03 2.0E-­‐03 1.0E-­‐03 108 Num-­‐Threads (CG) 16.0 14.0 12.0 10.0 8.0 6.0 Num-­‐Threads (CG) 1.0E-­‐03 8.0E-­‐04 6.0E-­‐04 4.0E-­‐04 2.0E-­‐04 0.0E+00 Num-­‐Threads (CG) Haswell HD
  103. 103. SRAD on Ivy Bridge 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 109 item item C C C C item item C C C C 6.0E-­‐01 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 0.0E+00 item C C C C item item C C C C item item C C C item item C C C C Num-­‐Threads (CG) 0.0E+00 item item Num-­‐Threads (CG) item 1.1E+00 1.0E+00 9.0E-­‐01 8.0E-­‐01 7.0E-­‐01 C 6.0E-­‐01 C C 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 CPU item C item item C C C C CPU item C item item C C Num-­‐Threads (CG) Input Stage GPU C C C C C C CPU C CPU Stage 1 CPU Stage 2 Output Stage GPU C C C C C C CPU Stage 6 CP-MG GPU C C C C C C CPU Stage 3 GPU C C C C C C CPU item C Stage 4 GPU item C C C C C C CPU item C C C C Stage 5 Input Stage GPU C C C C C C CPU C CPU Stage 1 GPU C C C C C C CPU C Stage 2 Output Stage GPU C C C C C C CPU C C Stage 6 DP-MG GPU C C C C C C CPU Stage 3 GPU item C C C C C C CPU item C C C C Stage 4 GPU item C C C C C C CPU item C C C C Stage 5
  104. 104. SRAD on Haswell 1.1E+00 9.0E-­‐01 7.0E-­‐01 5.0E-­‐01 item 3.0E-­‐01 1.0E-­‐01 item item item item Num-­‐Threads (CG) 1.8E-­‐01 1.6E-­‐01 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 110 item item C 8.0E-­‐01 7.0E-­‐01 6.0E-­‐01 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 0.0E+00 item item C Num-­‐Threads (CG) 0.0E+00 item Num-­‐Threads (CG) Input Stage GPU C C C C C C CPU C CPU Stage 1 GPU item C C C C C C CPU item C Stage 2 CPU item C Output Stage GPU C C C C C C CPU C Stage 6 DP-CG GPU C C C C C C CPU Stage 3 GPU C C C C C C CPU item C Stage 4 GPU C C C C C C CPU C Stage 5
  105. 105. On-going work • Test the model in other heterogeneous chips • ViVid LD running on Odroid XU-E – Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) – Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4) 0.0008 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0 112 Throughput CP-­‐CG Est. CP-­‐CG Mea. DP-­‐CG Est. DP-­‐CG Mea. CPU-­‐CG 1 2 3 4 0.00007 0.00006 0.00005 0.00004 0.00003 0.00002 0.00001 0 Throughput/Energy 1 2 3 4 CP-­‐CG Est. CP-­‐CG Mea. DP-­‐CG Est. DP-­‐CG Mea. CPU-­‐CG
  106. 106. Future Work • Consider energy for parallel loops scheduling/partitioning • Consider MARE as an alternative to our TBB-based implem. • Apply DVFS to reduce energy consumption – For video applications: no need to go faster than 33 fps • Also explore other parallel patterns: – Reduce – Parallel_do – … 113
  107. 107. Conclusions • Plenty of heterogeneous on-chip architectures out there • It is important to use both devices – Need to find the best mapping/distribution/scheduling out of the many possible alternatives. • Programming models and runtimes aimed at this goal are in their infancy: it may have huge impact in mobile market • Challenges: – Hide hardware complexity – Consider energy in the partition/scheduling decisions – Minimize overhead of adaptation policies 114
  108. 108. Collaborators • Mª Ángeles González Navarro (UMA, Spain) • Fracisco Corbera (UMA, Spain) • Antonio Vilches (UMA, Spain) • Andrés Rodríguez (UMA, Spain) • Alejandro Villegas (UMA, Spain) • Rubén Gran (U. Zaragoza, Spain) • Maria Jesús Garzarán (UIUC, USA) • Mert Dikmen (UIUC, USA) • Kurt Fellows (UIUC, USA) • Ehsan Totoni (UIUC, USA) 115
  109. 109. Questions asenjo@uma.es http://www.ac.uma.es/~asenjo 116

×