• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
 

HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

on

  • 578 views

Presentation HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng at the AMD Developer Summit (APU13) November 11-13, 2013.

Presentation HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng at the AMD Developer Summit (APU13) November 11-13, 2013.

Statistics

Views

Total Views
578
Views on SlideShare
578
Embed Views
0

Actions

Likes
1
Downloads
34
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng Presentation Transcript

    • Towards an Ecosystem for Heterogeneous Parallel Computing Wu  FENG   Dept.  of  Computer  Science     Dept.  of  Electrical  &  Computer  Engineering   Virginia  Bioinforma<cs  Ins<tute   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Japanese ‘Computnik’ Earth Simulator Shatters U.S. Supercomputer Hegemony © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Importance of High-Performance Computing (HPC) in the LARGE and in the Small Competitive Risk From Not Having Access to HEC Competitive Risk From Not Having Access to HPC Could exist and compete 3% 34% Could not exist as a business Could not compete on quality & testing issues 47% 16% Could not compete on time to market & cost Data from Council of Competitiveness. Sponsored Survey Conducted by IDC Only 3% of companies could exist and compete without HPC. ª  200+ participating companies, including many Fortune 500 (Proctor & Gamble and biological and chemical companies) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Computnik 2.0?   •  The Second Coming of Computnik? Computnik 2.0?   –  No  …  “only”  43%  faster  than  the  previous  #1  supercomputer,  but        à  $20M  cheaper  than  the  previous  #1  supercomputer        à  42%  less  power  consump<on   •  The  Second Coming of the “Beowulf Cluster” for HPC –  The further commoditization of HPC © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • The First Coming of the “Beowulf Cluster” •  Utilize commodity PCs (with commodity CPUs) to build a supercomputer The Second Coming of the “Beowulf Cluster” •  Utilize commodity PCs (with commodity CPUs) to build a supercomputer + •  Utilize commodity graphics processing units (GPUs) to build a supercomputer Issue: Extracting performance with programming ease and portability à productivity © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • “Holy Grail” Vision •  An ecosystem for heterogeneous parallel computing Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11) –  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Design  of   Composite   Structures   SoPware   Ecosystem   Heterogeneous  Parallel  Compu<ng  (HPC)  PlaUorm   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • An Ecosystem for Heterogeneous Parallel Computing ^ Applications Sequence Alignment Molecular Dynamics Earthquake Modeling Neuroinformatics CFD for Mini-Drones Software Ecosystem Heterogeneous Parallel Computing (HPC) Platform … inter-node (in brief) with application to BIG DATA (StreamMR) synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Programming GPUs CUDA •  NVIDIA’s proprietary framework OpenCL •  Open standard for heterogeneous parallel computing (Khronos Group) •  Vendor-neutral environment for CPUs, GPUs, APUs, and even FPGAs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Prevalence First  public  release  February  2007   First  public  release  December  2008   Search Date: 2013-03-25 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • CU2CL: CUDA-to-OpenCL Source-to-Source Translator† •  Works as a Clang plug-in to leverage its production-quality compiler framework. •  Covers primary CUDA constructs found in CUDA C and CUDA run-time API. •  Performs as well as codes manually ported from CUDA to OpenCL. See APU’13 Talk (PT-4057) from Tuesday, November 12 •  Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? †  “CU2CL:  A  CUDA-­‐to-­‐OpenCL  Translator  for  Mul<-­‐  and  Many-­‐core  Architectures,”  17th   IEEE  Int’l  Conf.  on  Parallel  &  Distributed  Systems  (ICPADS),  Dec.  2011.     © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • OpenCL: Write Once, Run Anywhere OpenCL Program CUDA Program CU2CL (“cuticle”) OpenCL-supported CPUs, GPUs, FPGAs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 NVIDIA GPUs synergy.cs.vt.edu  
    • Application CUDA Lines Lines Manually Changed % AutoTranslated bandwidthTest 891 5 98.9 BlackScholes 347 14 96.0 matrixMul 351 9 97.4 vectorAdd 147 0 100.0 Back Propagation 313 24 92.3 Hotspot 328 2 99.4 Needleman-Wunsch 430 3 99.3 SRAD 541 0 100.0 17,768 1,796 89.9 524 15 97.1 8,402 166 98.0 Fen Zi: Molecular Dynamics GEM: Molecular Modeling IZ PS: Neural Network © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Why CU2CL? •  Much larger # of apps implemented in CUDA than in OpenCL –  Idea §  Leverage scientists’ investment in CUDA to drive OpenCL adoption –  Issues (from the perspective of domain scientists) §  Writing from Scratch: Learning Curve (OpenCL is too low-level an API compared to CUDA. CUDA also low level.) §  Porting from CUDA: Tedious and Error-Prone •  Significant demand from major stakeholders (and Apple …) Why Not CU2CL? •  Just start with OpenCL?! •  CU2CL only does source-to-source translation at present –  Functional (code) portability? Yes. –  Performance portability? Mostly no. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap IEEE ICPADS ’11 P2S2 ’12 Parallel Computing ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 What about performance portability? synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Traditional Approach to Optimizing Compiler Architecture-Independent Optimization Architecture-Aware Optimization C.  del  Mundo,  W.  Feng.  “Towards  a  Performance-­‐Portable  FFT  Library  for  Heterogeneous  Computing,”    in  IEEE  IPDPS  ‘13.  Phoenix,  AZ,  USA,  May    2014.  (Under  review.)   C.  del  Mundo,  W.  Feng.  “Enabling  Efficient  Intra-­‐Warp  Communication  for  Fourier  Transforms  in  a  Many-­‐Core  Architecture”,  SC|13,  Denver,  CO,  Nov.  2013.     C.  del  Mundo,  V.  Adhinarayanan,  W.  Feng,  “Accelerating  FFT  for  Wideband  Channelization.”  in  IEEE  ICC  ‘13.  Budapest,  Hungary,  June  2013.   synergy.cs.vt.edu  
    • Computational Units Not Created Equal •  “AMD CPU ≠ Intel CPU” and “AMD GPU ≠ NVIDIA GPU” •  Initial performance of a CUDA-optimized N-body dwarf -32% “Architecture-unaware” or “Architecture-independent” © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Basic & Architecture-Unaware Execution Speedup over hand-tuned SSE 400 371 328 12% 300 224 200 100 192 163 88 0 Basic Architecture unaware Architecture aware NVIDIA GTX280 AMD 5870 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Optimization for N-body Molecular Modeling •  Optimization techniques on AMD GPUs Isolated –  Removing conditions à kernel splitting –  Local staging –  Using vector types –  Using image memory  •  Speedup over basic OpenCL GPU implementation Combined –  Isolated optimizations –  Combined optimizations MT: Max Threads; KS: Kernel Splitting; RA: Register Accumulator; RP: Register Preloading; LM: Local Memory; IM: Image Memory; LU{2,4}: Loop Unrolling{2x,4x}; VASM{2,4}: Vectorized Access & Scalar Math{float2, float4}; VAVM{2,4}: Vectorized Access & Vector Math{float2, float4} © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • AMD-Optimized N-body Dwarf -32% 371x speed-up •  12% better than NVIDIA GTX 280 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • FFT: Fast Fourier Transform •  A spectral method that is a critical building block across many disciplines 33   synergy.cs.vt.edu  
    • Peak SP Performance & Peak Memory Bandwidth © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Summary of Optimizations © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Optimizations •  RP: Register Preloading –  All data elements are first preloaded into the register file before use. –  Computation facilitated solely on registers •  LM-CM: Local Memory (Communication Only) –  Data elements are loaded into local memory only for communication –  Threads swap data elements solely in local memory •  CGAP: Coalesced Global Access Pattern –  Threads access memory contiguously •  VASM{2|4}: Vector Access, Scalar Math, float{2|4} –  Data elements are loaded as the listed vector type. –  Arithmetic operations are scalar (float × float). •  CM-K: Constant Memory for Kernel Arguments –  The twiddle multiplication state of FFT is precomputed on the CPU and stored in GPU constant memory for fast look-up. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Architecture-Optimized FFT © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Architecture-Optimized FFT © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap GPU Computing Gems J. Molecular Graphics & Modeling IEEE HPCC ’12 IEEE HPDC ‘13 IEEE ICPADS ’11 IEEE ICC ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap FUTURE  WORK   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • The Berkeley View † Sparse Linear Algebra •  Traditional Approach –  Applications that target existing hardware and programming models Structured Grids Spectral Methods N-Body Methods •  Berkeley Approach –  Hardware design that keeps future applications in mind –  Basis for future applications? 13 computational dwarfs A computational dwarf is a pattern of communication & computation that is common across a set of applications. Dense Linear Algebra Unstructured Grids Monte Carlo à   MapReduce and Combinational Logic Graph Traversal Dynamic Programming Backtrack & Branch+Bound Graphical Models Finite State Machine †   Asanovic,  K.,  et  al.  The  Landscape  of  Parallel  CompuDng  Research:  A  View  from  Berkeley.   Tech.  Rep.  UCB/EECS-­‐2006-­‐183,  University  of  California,  Berkeley,  Dec.  2006.     © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Example of a Computational Dwarf: N-Body •  Computational Dwarf: Pattern of computation & communication … that is common across a set of applications •  N-Body problems are studied in –  Cosmology, particle physics, biology, and engineering •  All have similar structures •  An N-Body benchmark can provide meaningful insight to people in all these fields •  Optimizations may be generally applicable as well RoadRunner Universe: Astrophysics © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 GEM: Molecular Modeling synergy.cs.vt.edu  
    • First Instantiation: OpenDwarfs (formerly “OpenCL and the 13 Dwarfs”)   •  Goal –  Provide common algorithmic methods, i.e., dwarfs, in a language that is “write once, run anywhere” (CPU, GPU, or even FPGA), i.e., OpenCL •  Part of a larger umbrella project (2008-2018), funded by the NSF Center for High-Performance Reconfigurable Computing (CHREC) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability ACM ICPE ’12: OpenCL & the 13 Dwarfs Roadmap IEEE Cluster ’11, FPGA ’11, IEEE ICPADS ’11, SAAHPC ’11 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance & Power Modeling •  Goals –  –  –  –  Robust framework Very high accuracy (Target: < 5% prediction error) Identification of portable predictors for performance and power Multi-dimensional characterization §  Performance à sequential, intra-node parallel, inter-node parallel §  Power à component level, node level, cluster level © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Problem Formulation: LP-Based Energy-Optimal DVFS Schedule •  Definitions –  A DVFS system exports n { (fi, Pi ) } settings. –  Ti : total execution time of a program running at setting i •  Given a program with deadline D, find a DVS schedule (t1*, …, tn*) such that –  If the program is executed for ti seconds at setting i, the total energy usage E is minimized, the deadline D is met, and the required work is completed. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Single-Coefficient β Performance Model •  Our Formulation –  Define the relative performance slowdown δ as T(f) / T(fMAX) – 1 –  Re-formulate two-coefficient model as a single-coefficient model: –  The coefficient β is computed at run-time using a regression method on the past MIPS rates reported from the built-in PMU. C.  Hsu  and  W.  Feng.   “A  Power-­‐Aware  Run-­‐Time   System  for  High-­‐Performance   Compu<ng,”  SC|05,  Nov.  2005.   © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • β – Adaptation on NAS Parallel Benchmarks C. Hsu and W. Feng. “A Power-Aware Run-Time System for High-Performance Computing,” SC|05, Nov. 2005. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Need for Better Performance & Power Modeling •  S Source:Virginia Tech © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Search for Reliable Predictors •  Re-visit performance counters used for prediction –  Applicability of performance counters across generations of architecture •  Performance counters monitored –  NPB and PARSEC benchmarks –  Platform: Intel Xeon E5645 CPU (Westmere-EP) Context switches (CS) Instruction issued (IIS) L3 data cache misses (L3 DCM) L2 data cache misses (L2 DCM) L2 instruction cache misses (L2 ICM) Instruction completed (TOTINS) Outstanding bus request cycles (OutReq) Instruction queue write cycles (InsQW) © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap ACM/IEEE SC ’05 ICPP ’07 IEEE/ACM CCGrid ‘09 IEEE GreenCom ’10 ACM CF ’11 IEEE Cluster ‘11 ISC ’12 IEEE ICPE ‘13 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap Performance Portability Performance Portability © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • What is Heterogeneous Task Scheduling? •  Automatically spreading tasks across heterogeneous compute resources –  CPUs, GPUs, APUs, FPGAs, DSPs, and so on •  Specify tasks at a higher level (currently OpenMP extensions) •  Run them across available resources automatically Goal •  A run-time system that intelligently uses what is available resource-wise and optimize for performance portability –  Each user should not have to implement this for themselves! © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Heterogeneous Task Scheduling © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • How to Heterogeneous Task Schedule (HTS) •  Accelerated OpenMP offers heterogeneous task scheduling with –  Programmability –  Functional portability, given underlying compilers –  Performance portability •  How? –  A simple extension to Accelerated OpenMP syntax for programmability –  Automatically dividing parallel tasks across arbitrary heterogeneous compute resources for functional portability §  CPUs §  GPUs §  APUs –  Intelligent runtime task scheduling for performance portability © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Programmability: Why Accelerated OpenMP? # pragma omp parallel for shared(in1,in2,out,pow) for (i=0; i<end; i++){ out[i] = in1[i]*in2[i]; pow[i] = pow[i]*pow[i]; } Traditional OpenMP # pragma acc_region_loop acc_copyin(in1[0:end],in2[0:end]) acc_copyout(out[0:end]) acc_copy(pow[0:end]) for (i=0; i<end; i++){ out[i] = in1[i] * in2[i]; pow[i] = pow[i]*pow[i]; } OpenMP Accelerator Directives # pragma acc_region_loop acc_copyin(in1[0:end],in2[0:end]) acc_copyout(out[0:end]) acc_copy(pow[0:end]) hetero(<cond>[,<scheduler>[,<ratio> [,<div>]]]) for (i=0; i<end; i++){ out[i] = in1[i] * in2[i]; pow[i] = pow[i]*pow[i]; } Our Proposed Extension © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • OpenMP Accelerator Behavior Original/Master thread Worker threads Parallel region Accelerated region #pragma omp acc_region … Implicit barrier #pragma omp parallel … Kernels © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • DESIRED OpenMP Accelerator Behavior Original/Master thread Worker threads Parallel region Accelerated region #pragma omp acc_region … #pragma omp parallel num_threads(2) #pragma omp parallel … © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Work-share a Region Across the Whole System #pragma omp acc_region … OR #pragma omp parallel … Original/Master thread Worker threads Parallel region Accelerated region synergy.cs.vt.edu  
    • Scheduling and Load-Balancing by Adaptation •  Measure computational suitability at runtime eTSAR: Task-Size Adapting Runtime •  Compute new distribution of work through a linear optimization approach Program phase Compute •  Re-distribute Scheduling before each pass work Original A:7 With GPU back−off 1.00 I = total iterations available 0.50 0.25 min( Adaptive 0.75 n X fj = fraction of iterations for compute unit j 0.75 t+ + tj ) j (8) fj = 1 j=1 pj = recent time/iteration for compute unit j f ⇤ p(1) f ⇤ p = t+ 2 2 1 1 1 n = number of compute devices f3 ⇤ p 3 f1 ⇤ p 1 = t + 2 + tj (or tj ) = time over (or under) equal . . . 2 3 4 1 2 3 4 n 1 X GPUs Number of+ fn ⇤ p n f1 ⇤ p 1 = t + n min( t1 + t1 · · · + t+ 1 + tn 1 ) (2) n t1 0.25 0.00 1 ) Percentage of time spent on computation j=1 nd scheduling n X ij = I (10) tn 1 (11) (b) Modified objective and constraints (3) Fig. 5: Linear program optimization and performance j=0 + 1 (9) t2 Split 0.50 (7) j=1 ij = iterations for compute unit j 0.00 1.00 n 1 X synergy.cs.vt.edu  
    • Heterogeneous Scheduling: Issues and Solutions •  Issue: Launching overhead is high on GPUs –  Using a work-queue, many GPU kernels may need to be run •  Solution: Schedule only at the beginning of a region –  The overhead is only paid once or a small number of times •  Issue: Without a work-queue, how do we balance load? –  The performance of each device must be predicted •  Solution: Allocate different amounts of work –  For the first pass, predict a reasonable initial division of work. We use a ratio between the number of CPU and GPU cores for this. –  For subsequent passes, use the performance of previous passes to predict following passes. © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • ly get Equation 6, which can be solved in only ctions and produces a result within one iteration program for the common case. i0 c i0 c ⇤ pc = ⇤ pc = the next iteration, we assume that iterations take the sa amount of time on average from one pass to the next. For general case with an arbitrary number of compute units, use a linear program for which Figure I lists the necess variables. Equation 1 represents the objective function, w the accompanying constraints in Equations 2 through 5. Dynamic Ratios i 0 ⇤ pg g (i0 i0 ) ⇤ pg t c ((i0 i0 ) ⇤ pg )/pc t c ((i0 i0 ) ⇤ pg )/pc t c (i0 ⇤ pg )/pc t (i0 ⇤ pg )/pc t n 1 X •  All dynamic schedulers share a common min( t1 i0 = j=1 c basis 0 ic = •  pg the end of a pass, a version of the linear i0 + (i0 ⇤At)/pc = c c /pc + (i0 ⇤program at right is solved c pg )/pc = i 2 ⇤ p2 0 ic ⇤ (pg + pthe simplified case, we actually solve •  For c ) = it ⇤ pg i 3 ⇤ p3 –  i0 c = (it ⇤ pg )/(pg + pc ) (6) + + t1 · · · + t+ n n X 1 + tn 1) ij = I j=0 i 1 ⇤ p1 = t+ 1 i 1 ⇤ p1 = t+ 2 . . . t1 t2 Purpose where i n ⇤ pn i 1 ⇤ p 1 = t + 1 t n 1 n ’ is the number of iterations to assign the CPU n and testing–  i c indicated that neither of the schedExpressed in words, the linear program calculates is entirely appropriate intotal iterations of the next pass certain cases. Thus, we iteration counts with predicted times that are as close –  it is the other schedulers: split and quick. identical – schedulingtime per iterationleast (c)pu/(g)pu onbetween all devices as possible. The constrai p* is the requires more at for Our dynamic ecutions of thethe last pass parallel region to compute an tio, which works well for applications which ple passes. However, some parallel regions are ly a few times, or even just once. Split scheduls these regions. In this case, each pass through© W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu   begins a loop that iterates div times with each cuting totaltasks/div tasks. Thus, the runtime
    • Dynamic Schedulers •  Schedulers –  Dynamic –  Split –  Quick •  Arguments –  Ratio §  Initial split to use, i.e., ratio between CPU and GPU performance –  Div §  How far to subdivide the region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Static Scheduler Ratio: 0.5 250 it 500 it 500 it 1,000 it 250 it 500 it Pass 1 Original/Master thread Pass 2 Worker threads Parallel region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Accelerated region synergy.cs.vt.edu  
    • Dynamic Scheduler Ratio: 0.5 250 it 900 it 500 it 1,000 it 250 it 100 it Pass 1 Original/Master thread Pass 2 Worker threads Parallel region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Accelerated region synergy.cs.vt.edu  
    • Split Scheduler Ratio: 0.5 Div: 4 125 it 125 it 125 it 125 it 500 it Pass 1 Reschedule Original/Master thread Worker threads Parallel region Accelerated region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Arguments: Ratio=0.5 Div=4 Dynamic vs. Split Scheduler Dynamic 500 it 900 it 1,000 it 250 it 250 it 100 it Pass 1 Split 63 it 113 it Pass 2 113 it 113 it 113 it 113 it 113 it 113 it 12 it 12 it 1,000 it 500 it 62 it 12 it 12 it 12 it 12 it 12 it Reschedule Original/Master thread Worker threads Parallel region Accelerated region 88   synergy.cs.vt.edu  
    • Quick Scheduler Ratio: 0.5 Div: 4 375 it 125 it 500 it Pass 1 Pass 2 Reschedule Original/Master thread Worker threads Parallel region Accelerated region © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Arguments: Ratio=0.5 Div=4 Dynamic vs. Quick Scheduler 900 it Dynamic 500 it 1,000 it 250 it 250 it 100 it Pass 1 Quick 63 it Pass 2 338 it 900 it 1,000 it 500 it 62 it 37 it 100 it Reschedule Original/Master thread Worker threads Parallel region Accelerated region 90   synergy.cs.vt.edu  
    • Benchmarks •  NAS CG – Many passes (1,800 for C class) –  The Conjugate Gradient benchmark from the NAS Parallel Benchmarks •  GEM – One pass –  Molecular Modeling, computes the electrostatic potential along the surface of a macromolecule •  K-Means – Few passes –  Iterative clustering of points •  Helmholtz – Few passes, GPU Unsuitable –  Jacobi iterative method implementing the Helmholtz equation © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Experimental Setup •  System –  –  –  –  12-core AMD Opteron 6174 CPU NVIDIA Tesla C2050 GPU Linux 2.6.27.19 CCE compiler with OpenMP accelerator extensions •  Procedures –  All parameters default, unless otherwise specified –  Results represent 5 or more runs © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Results across Schedulers for all Benchmarks Scheduler CPU GPU cg Static Dynamic gem Split Quick helmholtz kmeans 1.0 8 Speedup over 12−core CPU 1.4 0.8 1.2 1.5 6 1.0 0.6 0.8 1.0 4 0.4 0.6 0.4 0.5 2 0.2 0.2 0.0 0 cg 0.0 gem 0.0 helmholtz kmeans Application © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Performance, Programmability, Portability Roadmap IEEE IPDPS ’12 © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • An Ecosystem for ^ Heterogeneous Parallel Computing Roadmap Much work still to be done. •  OpenCL and the 13 Dwarfs Beta release pending •  Source-to-Source Translation CU2CL only & no optimization •  Architecture-Aware Optimization Only manual optimizations •  Performance & Power Modeling Preliminary & pre-multicore •  Affinity-Based Cost Modeling Empirical results; modeling in progress •  Heterogeneous Task Scheduling Preliminary with OpenMP © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Intra-Node An Ecosystem for Heterogeneous Parallel Computing ^ Applications Sequence Alignment Molecular Dynamics Earthquake Modeling Neuroinformatics Avionic Composites Software Ecosystem Heterogeneous Parallel Computing (HPC) Platform … from Intra-Node to Inter-Node synergy.cs.vt.edu  
    • Programming CPU-GPU Clusters (e.g., MPI+CUDA) GPU   GPU   device   memory   device   memory   CPU   main   memory   Node 1 Network CPU   main   memory   Node 2 Rank = 0 Rank = 1 if(rank == 0) { cudaMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf, .. ..) } if(rank == 1) { MPI_Recv(host_buf, .. ..) cudaMemcpy(dev_buf, host_buf, H2D) } © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Goal of Programming CPU-GPU Clusters (MPI + Any Acc) IEEE HPCC ’12 ACM HPDC ‘13 GPU   device   memory   CPU   main   memory   Network GPU   device   memory   main   memory   Rank = 0 if(rank == 0) { MPI_Send(any_buf, .. ..); } CPU   Rank = 1 if(rank == 1) { MPI_Recv(any_buf, .. ..); } © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • “Virtualizing” GPUs … Compute Node VOCL Proxy OpenCL API Native OpenCL Library Compute Node Compute Node Application OpenCL API Application MPI OpenCL API VOCL Library Physical GPU Native OpenCL Library MPI Virtual GPU Virtual GPU Physical GPU Compute Node VOCL Proxy OpenCL API Native OpenCL Library Traditional Model Our Model © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 Physical GPU synergy.cs.vt.edu  
    • Conclusion •  An ecosystem for heterogeneous parallel computing Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11) –  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • People © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Funding Acknowledgements Microsoft featured our “Big Data in the Cloud” grant in U.S. Congressional Testimony © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu  
    • Wu  Feng,  wfeng@vt.edu,  540-­‐231-­‐1192   http://synergy.cs.vt.edu/ http://www.chrec.org/ http://www.green500.org/ http://www.mpiblast.org/ http://sss.cs.vt.edu/ “Accelerators ‘R Us”! http://accel.cs.vt.edu/ http://myvice.cs.vt.edu/ © W. Feng, November 2013 wfeng@vt.edu, 540.231.1192 synergy.cs.vt.edu