A Survey on in-a-box parallel computing and its implications on system software                                research   ...
Motivation   Technology ratios matter, Jim Gray   In the face of such "10X" forces, you can lose control of your    dest...
Survey Scope and Strategy        Parallel                   Parallel       Application               Middleware          P...
Contents   Background   Parallel Programming Model and Productivity Tools   Optimization of System Software   Supporti...
Background
Why multicore?   Multicore CPU       Power wall       ILP(instruction level parallelism) wall       Memory wall      ...
Architecture of GPGPU core
Parallel Programming Model and               Productivity Tools
OpenMP   Parallel Programming API for shared memory    multiprocessing programming in C, C++, Fortran   Use language ext...
OpenMP (cont’d)   Fork-and-join model       Bounded parallel loop, reduction   Task-creation-and-join model       Unbo...
Intel TBB                (Threading Building Block)   Similar to OpenMP       API for shared memory multiprocessing    ...
Nvidia CUDA                   (Compute Unified Device Architecture)   CUDA       Computing engine in Nvidia GPU       P...
Nvidia CUDA (cont’d)     Execution Model   Kernel Memory Access
OpenCL          (Open Compute Language)   CPU/GPU heterogeneous computing framework    standardized by Khronous group    ...
Lithe: Enabling Efficient Composition ofParallel Libraries   Who?       ParLab, UC Berkeley, HotPar’09   Problem      ...
Lithe: Enabling Efficient Composition ofParallel Libraries (cont’d)   Solution       Virtualized thread are bad for para...
Concurrency bug detection: DataCollider   Who?       Microsoft Research, OSDI’10   Problem       Detecting concurrency...
Concurrency bug detection: SyncFinder   Who?       UC San Diego, OSDI ’10   Problem       How to find ad-hoc synchroni...
Optimization of System Software
Memory Allocation: Hoard   Who?       UT, ASPLOS’00   Problem       Memory allocator is performance bottleneck in mult...
Memory Allocation: Hoard (cont’d)   Solution       Per-processor heap to reduce        lock contention and false        ...
Memory Allocation: Xmalloc   Who?       UIUC, ICCIT’10   Problem       Scalable malloc for CUDA whereby hundreds of th...
System Call: FlexSC   Who?       University of Toronto, OSDI’10   Problem       Negative performance impact of system ...
Revisiting OS Architecture
Multikernel   Who?       ETH Zurich, Microsoft Research Cambridge, SOSP’09   Problem       System diversity          ...
Multikernel (cont’d)   Problem (cont’d)                                                                           SHM:sta...
Multikernel (cont’d)   Solution       Today’s computer is already a distributed system. Why isn’t        your OS?      ...
An Analysis of Linux Scalability to ManyCores   Who?       MIT CSAIL, OSDI’10   Problem       If so, is Linux scalable...
Supporting GPU in a virtualized                  environment
HyVM             (Hybrid Virtual Machines)   Who?       Georgia Tech   Problem       Asymmetries in performance, memor...
HyVM (cont’d)   Solution (cont’d)           HyVM Architecture       GViM: GPU Virtualization Architecture       Memory ma...
VMGL          (Virtualizing OpenGL)   Who?       University of Toronto, VEE’07   Problem       How to support OpenGL i...
Utilizing GPU in Middleware
StoreGPU   Who?       University of British Columbia, HDPC’10   Problem       In CAS(Contents Addressable Storage),   ...
PacketShader   Who?       KAIST, SIGCOMM’10, NSDI’11   Problem       How to boot up performance of software router   ...
Conclusion
S
Upcoming SlideShare
Loading in …5
×

A Survey on in-a-box parallel computing and its implications on system software research

1,316 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,316
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

A Survey on in-a-box parallel computing and its implications on system software research

  1. 1. A Survey on in-a-box parallel computing and its implications on system software research Changwoo Min (multics69@gmail.com)
  2. 2. Motivation Technology ratios matter, Jim Gray In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove What is the implications of multicore evolution for system software researcher?
  3. 3. Survey Scope and Strategy Parallel Parallel Application Middleware Parallel Programming Model System Library Operating System Multicore Multicore Machine Monitor Virtual GPGPU CPU CPU Multicore Multicore GPGPU GPGPU … CPU CPU
  4. 4. Contents Background Parallel Programming Model and Productivity Tools Optimization of System Software Supporting GPU in a Virtualized Environment Utilizing GPU in Middleware Conclusion
  5. 5. Background
  6. 6. Why multicore? Multicore CPU  Power wall  ILP(instruction level parallelism) wall  Memory wall  Wire delay GPGPU(General Purpose computing on a Graphic Processing Unit)  GPU typically handles computation only for computer graphics.  Add followings to the rendering pipelines  programmable stages  higher precision arithmetic  Use stream processing on non-graphics data.
  7. 7. Architecture of GPGPU core
  8. 8. Parallel Programming Model and Productivity Tools
  9. 9. OpenMP Parallel Programming API for shared memory multiprocessing programming in C, C++, Fortran Use language extension – “#pragma omp”  Need compiler support
  10. 10. OpenMP (cont’d) Fork-and-join model  Bounded parallel loop, reduction Task-creation-and-join model  Unbounded loop, recursive algorithm, producer/consumer
  11. 11. Intel TBB (Threading Building Block) Similar to OpenMP  API for shared memory multiprocessing  Fork-and-join  parallel-for, parallel-reduce  Task-creation-and-join  Task scheduler Different from OpenMP  C++ template library  Concurrent container class  Hash map, vector, queue  Various synchronization mechanism  mutex, spin lock, …  Atomic type, atomic operations  Scalable memory allocator
  12. 12. Nvidia CUDA (Compute Unified Device Architecture) CUDA  Computing engine in Nvidia GPU  Programming framework for Nvidia GPU  Use CUDA extended C  declspecs, keywords, intrinsic, runtime API, function launch, … CUDA extended C Compiling CUDA Code Processing flow on CUDA
  13. 13. Nvidia CUDA (cont’d) Execution Model Kernel Memory Access
  14. 14. OpenCL (Open Compute Language) CPU/GPU heterogeneous computing framework standardized by Khronous group OpenCL Memory Model CUDA, OpenCL Example
  15. 15. Lithe: Enabling Efficient Composition ofParallel Libraries Who?  ParLab, UC Berkeley, HotPar’09 Problem  Composition of parallel libraries shows performance anomaly
  16. 16. Lithe: Enabling Efficient Composition ofParallel Libraries (cont’d) Solution  Virtualized thread are bad for parallel libraries.  Harts  Unvirtualized hardware thread context  Sharing harts  Lithe  Cooperative hierarchical scheduler framework for harts
  17. 17. Concurrency bug detection: DataCollider Who?  Microsoft Research, OSDI’10 Problem  Detecting concurrency data race bug is difficult.  For large system such as Windows kernel, runtime overhead is critical. Solution  Sampling using code break point  When a code break point is trapped,  Set data break point for its operand  Sleep for a while  If the data is changed, it could be data race.
  18. 18. Concurrency bug detection: SyncFinder Who?  UC San Diego, OSDI ’10 Problem  How to find ad-hoc synchronization Solution  Formalize patterns of ad-hoc synchronization  Detect such patterns using LLVM
  19. 19. Optimization of System Software
  20. 20. Memory Allocation: Hoard Who?  UT, ASPLOS’00 Problem  Memory allocator is performance bottleneck in multi processor environment.  Lock contention, False sharing, Blow up Allocator induced false sharing
  21. 21. Memory Allocation: Hoard (cont’d) Solution  Per-processor heap to reduce lock contention and false sharing  Global heap  Borrow memory from global heap to increase per-processor heap  Return memory to global heap if there are too much free memory in a per-processor heap
  22. 22. Memory Allocation: Xmalloc Who?  UIUC, ICCIT’10 Problem  Scalable malloc for CUDA whereby hundreds of threads run concurrently. Solution  Memory allocation coalescing
  23. 23. System Call: FlexSC Who?  University of Toronto, OSDI’10 Problem  Negative performance impact of system call is huge.  Direct cost + indirect cost Solution  Batching, asynchronous system call
  24. 24. Revisiting OS Architecture
  25. 25. Multikernel Who?  ETH Zurich, Microsoft Research Cambridge, SOSP’09 Problem  System diversity  It is no longer acceptable (or useful) to tune a general-purpose OS design for a particular hardware model.
  26. 26. Multikernel (cont’d) Problem (cont’d) SHM:stalled cycle (no locking!)  The interconnects matters 8-socket Nahelem On-chip interconnects SHM vs. Message Passing  Core diversity  Programmable NICs  GPU  FPGA in CPU sockets
  27. 27. Multikernel (cont’d) Solution  Today’s computer is already a distributed system. Why isn’t your OS?  Barallelfish  Implementation of the multikernel approach  Message passing, shared nothing, replica maintenance
  28. 28. An Analysis of Linux Scalability to ManyCores Who?  MIT CSAIL, OSDI’10 Problem  If so, is Linux scalable enough? Solution  Test linux scalability using 48 Intel cores with 7 applications  No kernel problems up to 48 cores  3002 LOC patches Sloopy counter : replicated reference counter
  29. 29. Supporting GPU in a virtualized environment
  30. 30. HyVM (Hybrid Virtual Machines) Who?  Georgia Tech Problem  Asymmetries in performance, memory and cache  Functional differences  Multiple accelerators  Vector processor  Floating point  Additional instructions for accelerations Solution  heterogeneity- and asymmetry-aware hypervisors
  31. 31. HyVM (cont’d) Solution (cont’d) HyVM Architecture GViM: GPU Virtualization Architecture Memory management in GViM Harmony CPU/GPU co-scheduling
  32. 32. VMGL (Virtualizing OpenGL) Who?  University of Toronto, VEE’07 Problem  How to support OpenGL in a virtual machine environment Solution  Forward OpenGL command to the driver domain
  33. 33. Utilizing GPU in Middleware
  34. 34. StoreGPU Who?  University of British Columbia, HDPC’10 Problem  In CAS(Contents Addressable Storage),  How to minimizing hash calculation cost Solution  Offloading to GPU StoreGPU Architecture
  35. 35. PacketShader Who?  KAIST, SIGCOMM’10, NSDI’11 Problem  How to boot up performance of software router Solution  Offload stateless (parallelizable) packet processing to GPU PacketShader Architecture Basic Workflow of PacketShader
  36. 36. Conclusion
  37. 37. S

×