A Survey on in-a-box parallel computing
 and its implications on system software
                                research

        Changwoo Min (multics69@gmail.com)
Motivation
   Technology ratios matter, Jim Gray




   In the face of such "10X" forces, you can lose control of your
    destiny, Andrew S Grove




   What is the implications of multicore evolution for system software
    researcher?
Survey Scope and Strategy


        Parallel                   Parallel
       Application               Middleware

          Parallel Programming Model

                  System Library

                 Operating System

     Multicore       Multicore Machine Monitor
                          Virtual
                                  GPGPU
      CPU             CPU
     Multicore       Multicore
                                     GPGPU       GPGPU   …
      CPU             CPU
Contents
   Background

   Parallel Programming Model and Productivity Tools

   Optimization of System Software

   Supporting GPU in a Virtualized Environment

   Utilizing GPU in Middleware

   Conclusion
Background
Why multicore?
   Multicore CPU
       Power wall
       ILP(instruction level parallelism) wall
       Memory wall
       Wire delay

   GPGPU(General Purpose computing on a Graphic Processing
    Unit)
       GPU typically handles computation only for computer graphics.
       Add followings to the rendering pipelines
           programmable stages
           higher precision arithmetic
       Use stream processing on non-graphics data.
Architecture of GPGPU core
Parallel Programming Model and
               Productivity Tools
OpenMP
   Parallel Programming API for shared memory
    multiprocessing programming in C, C++, Fortran

   Use language extension – “#pragma omp”
       Need compiler support
OpenMP (cont’d)
   Fork-and-join model
       Bounded parallel loop, reduction




   Task-creation-and-join model
       Unbounded loop, recursive algorithm, producer/consumer
Intel TBB                (Threading Building Block)
   Similar to OpenMP
       API for shared memory multiprocessing
       Fork-and-join
           parallel-for, parallel-reduce
       Task-creation-and-join
           Task scheduler

   Different from OpenMP
       C++ template library
       Concurrent container class
           Hash map, vector, queue
       Various synchronization mechanism
           mutex, spin lock, …
       Atomic type, atomic operations
       Scalable memory allocator
Nvidia CUDA                   (Compute Unified Device Architecture)

   CUDA
       Computing engine in Nvidia GPU
       Programming framework for Nvidia GPU
       Use CUDA extended C
           declspecs, keywords, intrinsic, runtime API, function launch, …




        CUDA extended C             Compiling CUDA Code       Processing flow on CUDA
Nvidia CUDA (cont’d)




     Execution Model   Kernel Memory Access
OpenCL          (Open Compute Language)

   CPU/GPU heterogeneous computing framework
    standardized by Khronous group




      OpenCL Memory Model          CUDA, OpenCL Example
Lithe: Enabling Efficient Composition of
Parallel Libraries
   Who?
       ParLab, UC Berkeley, HotPar’09

   Problem
       Composition of parallel libraries shows performance anomaly
Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
   Solution
       Virtualized thread are bad for parallel libraries.
       Harts
           Unvirtualized hardware thread context
           Sharing harts
       Lithe
           Cooperative hierarchical scheduler framework for harts
Concurrency bug detection: DataCollider
   Who?
       Microsoft Research, OSDI’10
   Problem
       Detecting concurrency data race bug is difficult.
       For large system such as Windows kernel, runtime overhead is
        critical.
   Solution
       Sampling using code break point
       When a code break point is trapped,
           Set data break point for its operand
           Sleep for a while
           If the data is changed, it could be data race.
Concurrency bug detection: SyncFinder
   Who?
       UC San Diego, OSDI ’10
   Problem
       How to find ad-hoc synchronization
   Solution
       Formalize patterns of ad-hoc synchronization
       Detect such patterns using LLVM
Optimization of System Software
Memory Allocation: Hoard
   Who?
       UT, ASPLOS’00
   Problem
       Memory allocator is performance bottleneck in multi
        processor environment.
       Lock contention, False sharing, Blow up




          Allocator induced false sharing
Memory Allocation: Hoard (cont’d)
   Solution
       Per-processor heap to reduce
        lock contention and false
        sharing
       Global heap
           Borrow memory from global
            heap to increase per-processor
            heap
           Return memory to global heap if
            there are too much free memory
            in a per-processor heap
Memory Allocation: Xmalloc
   Who?
       UIUC, ICCIT’10
   Problem
       Scalable malloc for CUDA whereby hundreds of threads run
        concurrently.
   Solution
       Memory allocation coalescing
System Call: FlexSC
   Who?
       University of Toronto, OSDI’10
   Problem
       Negative performance impact of system call is huge.
           Direct cost + indirect cost
   Solution
       Batching, asynchronous system call
Revisiting OS Architecture
Multikernel
   Who?
       ETH Zurich, Microsoft Research Cambridge, SOSP’09
   Problem
       System diversity
           It is no longer acceptable (or useful) to tune a general-purpose OS
            design for a particular hardware model.
Multikernel (cont’d)
   Problem (cont’d)




                                                                           SHM:stalled cycle (no locking!)
       The interconnects matters




    8-socket Nahelem     On-chip interconnects   SHM vs. Message Passing


       Core diversity
           Programmable NICs
           GPU
           FPGA in CPU sockets
Multikernel (cont’d)
   Solution
       Today’s computer is already a distributed system. Why isn’t
        your OS?




       Barallelfish
           Implementation of the multikernel approach
           Message passing, shared nothing, replica maintenance
An Analysis of Linux Scalability to Many
Cores
   Who?
       MIT CSAIL, OSDI’10
   Problem
       If so, is Linux scalable enough?
   Solution
       Test linux scalability using 48 Intel cores with 7 applications
       No kernel problems up to 48 cores
           3002 LOC patches


                                      Sloopy counter
                                      : replicated reference counter
Supporting GPU in a virtualized
                  environment
HyVM             (Hybrid Virtual Machines)

   Who?
       Georgia Tech
   Problem
       Asymmetries in performance, memory and cache
       Functional differences
           Multiple accelerators
           Vector processor
           Floating point
           Additional instructions for accelerations
   Solution
       heterogeneity- and asymmetry-aware hypervisors
HyVM (cont’d)
   Solution (cont’d)




           HyVM Architecture       GViM: GPU Virtualization Architecture




       Memory management in GViM    Harmony CPU/GPU co-scheduling
VMGL          (Virtualizing OpenGL)

   Who?
       University of Toronto, VEE’07
   Problem
       How to support OpenGL in a virtual machine environment
   Solution
       Forward OpenGL command to the driver domain
Utilizing GPU in Middleware
StoreGPU
   Who?
       University of British Columbia, HDPC’10
   Problem
       In CAS(Contents Addressable Storage),
           How to minimizing hash calculation cost
   Solution
       Offloading to GPU



                                           StoreGPU Architecture
PacketShader
   Who?
       KAIST, SIGCOMM’10, NSDI’11
   Problem
       How to boot up performance of software router
   Solution
       Offload stateless (parallelizable) packet processing to GPU




          PacketShader Architecture     Basic Workflow of PacketShader
Conclusion
S

A Survey on in-a-box parallel computing and its implications on system software research

  • 1.
    A Survey onin-a-box parallel computing and its implications on system software research Changwoo Min (multics69@gmail.com)
  • 2.
    Motivation  Technology ratios matter, Jim Gray  In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove  What is the implications of multicore evolution for system software researcher?
  • 3.
    Survey Scope andStrategy Parallel Parallel Application Middleware Parallel Programming Model System Library Operating System Multicore Multicore Machine Monitor Virtual GPGPU CPU CPU Multicore Multicore GPGPU GPGPU … CPU CPU
  • 4.
    Contents  Background  Parallel Programming Model and Productivity Tools  Optimization of System Software  Supporting GPU in a Virtualized Environment  Utilizing GPU in Middleware  Conclusion
  • 5.
  • 6.
    Why multicore?  Multicore CPU  Power wall  ILP(instruction level parallelism) wall  Memory wall  Wire delay  GPGPU(General Purpose computing on a Graphic Processing Unit)  GPU typically handles computation only for computer graphics.  Add followings to the rendering pipelines  programmable stages  higher precision arithmetic  Use stream processing on non-graphics data.
  • 7.
  • 8.
    Parallel Programming Modeland Productivity Tools
  • 9.
    OpenMP  Parallel Programming API for shared memory multiprocessing programming in C, C++, Fortran  Use language extension – “#pragma omp”  Need compiler support
  • 10.
    OpenMP (cont’d)  Fork-and-join model  Bounded parallel loop, reduction  Task-creation-and-join model  Unbounded loop, recursive algorithm, producer/consumer
  • 11.
    Intel TBB (Threading Building Block)  Similar to OpenMP  API for shared memory multiprocessing  Fork-and-join  parallel-for, parallel-reduce  Task-creation-and-join  Task scheduler  Different from OpenMP  C++ template library  Concurrent container class  Hash map, vector, queue  Various synchronization mechanism  mutex, spin lock, …  Atomic type, atomic operations  Scalable memory allocator
  • 12.
    Nvidia CUDA (Compute Unified Device Architecture)  CUDA  Computing engine in Nvidia GPU  Programming framework for Nvidia GPU  Use CUDA extended C  declspecs, keywords, intrinsic, runtime API, function launch, … CUDA extended C Compiling CUDA Code Processing flow on CUDA
  • 13.
    Nvidia CUDA (cont’d) Execution Model Kernel Memory Access
  • 14.
    OpenCL (Open Compute Language)  CPU/GPU heterogeneous computing framework standardized by Khronous group OpenCL Memory Model CUDA, OpenCL Example
  • 15.
    Lithe: Enabling EfficientComposition of Parallel Libraries  Who?  ParLab, UC Berkeley, HotPar’09  Problem  Composition of parallel libraries shows performance anomaly
  • 16.
    Lithe: Enabling EfficientComposition of Parallel Libraries (cont’d)  Solution  Virtualized thread are bad for parallel libraries.  Harts  Unvirtualized hardware thread context  Sharing harts  Lithe  Cooperative hierarchical scheduler framework for harts
  • 17.
    Concurrency bug detection:DataCollider  Who?  Microsoft Research, OSDI’10  Problem  Detecting concurrency data race bug is difficult.  For large system such as Windows kernel, runtime overhead is critical.  Solution  Sampling using code break point  When a code break point is trapped,  Set data break point for its operand  Sleep for a while  If the data is changed, it could be data race.
  • 18.
    Concurrency bug detection:SyncFinder  Who?  UC San Diego, OSDI ’10  Problem  How to find ad-hoc synchronization  Solution  Formalize patterns of ad-hoc synchronization  Detect such patterns using LLVM
  • 19.
  • 20.
    Memory Allocation: Hoard  Who?  UT, ASPLOS’00  Problem  Memory allocator is performance bottleneck in multi processor environment.  Lock contention, False sharing, Blow up Allocator induced false sharing
  • 21.
    Memory Allocation: Hoard(cont’d)  Solution  Per-processor heap to reduce lock contention and false sharing  Global heap  Borrow memory from global heap to increase per-processor heap  Return memory to global heap if there are too much free memory in a per-processor heap
  • 22.
    Memory Allocation: Xmalloc  Who?  UIUC, ICCIT’10  Problem  Scalable malloc for CUDA whereby hundreds of threads run concurrently.  Solution  Memory allocation coalescing
  • 23.
    System Call: FlexSC  Who?  University of Toronto, OSDI’10  Problem  Negative performance impact of system call is huge.  Direct cost + indirect cost  Solution  Batching, asynchronous system call
  • 24.
  • 25.
    Multikernel  Who?  ETH Zurich, Microsoft Research Cambridge, SOSP’09  Problem  System diversity  It is no longer acceptable (or useful) to tune a general-purpose OS design for a particular hardware model.
  • 26.
    Multikernel (cont’d)  Problem (cont’d) SHM:stalled cycle (no locking!)  The interconnects matters 8-socket Nahelem On-chip interconnects SHM vs. Message Passing  Core diversity  Programmable NICs  GPU  FPGA in CPU sockets
  • 27.
    Multikernel (cont’d)  Solution  Today’s computer is already a distributed system. Why isn’t your OS?  Barallelfish  Implementation of the multikernel approach  Message passing, shared nothing, replica maintenance
  • 28.
    An Analysis ofLinux Scalability to Many Cores  Who?  MIT CSAIL, OSDI’10  Problem  If so, is Linux scalable enough?  Solution  Test linux scalability using 48 Intel cores with 7 applications  No kernel problems up to 48 cores  3002 LOC patches Sloopy counter : replicated reference counter
  • 29.
    Supporting GPU ina virtualized environment
  • 30.
    HyVM (Hybrid Virtual Machines)  Who?  Georgia Tech  Problem  Asymmetries in performance, memory and cache  Functional differences  Multiple accelerators  Vector processor  Floating point  Additional instructions for accelerations  Solution  heterogeneity- and asymmetry-aware hypervisors
  • 31.
    HyVM (cont’d)  Solution (cont’d) HyVM Architecture GViM: GPU Virtualization Architecture Memory management in GViM Harmony CPU/GPU co-scheduling
  • 32.
    VMGL (Virtualizing OpenGL)  Who?  University of Toronto, VEE’07  Problem  How to support OpenGL in a virtual machine environment  Solution  Forward OpenGL command to the driver domain
  • 33.
    Utilizing GPU inMiddleware
  • 34.
    StoreGPU  Who?  University of British Columbia, HDPC’10  Problem  In CAS(Contents Addressable Storage),  How to minimizing hash calculation cost  Solution  Offloading to GPU StoreGPU Architecture
  • 35.
    PacketShader  Who?  KAIST, SIGCOMM’10, NSDI’11  Problem  How to boot up performance of software router  Solution  Offload stateless (parallelizable) packet processing to GPU PacketShader Architecture Basic Workflow of PacketShader
  • 36.
  • 37.