A Survey on in-a-box parallel computing and its implications on system software research

A Survey on in-a-box parallel computing
and its implications on system software
research

Changwoo Min (multics69@gmail.com)

Motivation
 Technology ratios matter, Jim Gray

 In the face of such "10X" forces, you can lose control of your
destiny, Andrew S Grove

 What is the implications of multicore evolution for system software
researcher?

Survey Scope and Strategy

Parallel Parallel
Application Middleware

Parallel Programming Model

System Library

Operating System

Multicore Multicore Machine Monitor
Virtual
GPGPU
CPU CPU
Multicore Multicore
GPGPU GPGPU …
CPU CPU

Contents
 Background

 Parallel Programming Model and Productivity Tools

 Optimization of System Software

 Supporting GPU in a Virtualized Environment

 Utilizing GPU in Middleware

 Conclusion

Why multicore?
 Multicore CPU
 Power wall
 ILP(instruction level parallelism) wall
 Memory wall
 Wire delay

 GPGPU(General Purpose computing on a Graphic Processing
Unit)
 GPU typically handles computation only for computer graphics.
 Add followings to the rendering pipelines
 programmable stages
 higher precision arithmetic
 Use stream processing on non-graphics data.

Parallel Programming Model and
Productivity Tools

OpenMP
 Parallel Programming API for shared memory
multiprocessing programming in C, C++, Fortran

 Use language extension – “#pragma omp”
 Need compiler support

OpenMP (cont’d)
 Fork-and-join model
 Bounded parallel loop, reduction

 Task-creation-and-join model
 Unbounded loop, recursive algorithm, producer/consumer

Intel TBB (Threading Building Block)
 Similar to OpenMP
 API for shared memory multiprocessing
 Fork-and-join
 parallel-for, parallel-reduce
 Task-creation-and-join
 Task scheduler

 Different from OpenMP
 C++ template library
 Concurrent container class
 Hash map, vector, queue
 Various synchronization mechanism
 mutex, spin lock, …
 Atomic type, atomic operations
 Scalable memory allocator

Nvidia CUDA (Compute Unified Device Architecture)

 CUDA
 Computing engine in Nvidia GPU
 Programming framework for Nvidia GPU
 Use CUDA extended C
 declspecs, keywords, intrinsic, runtime API, function launch, …

CUDA extended C Compiling CUDA Code Processing flow on CUDA

Nvidia CUDA (cont’d)

Execution Model Kernel Memory Access

OpenCL (Open Compute Language)

 CPU/GPU heterogeneous computing framework
standardized by Khronous group

OpenCL Memory Model CUDA, OpenCL Example

Lithe: Enabling Efficient Composition of
Parallel Libraries
 Who?
 ParLab, UC Berkeley, HotPar’09

 Problem
 Composition of parallel libraries shows performance anomaly

Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
 Solution
 Virtualized thread are bad for parallel libraries.
 Harts
 Unvirtualized hardware thread context
 Sharing harts
 Lithe
 Cooperative hierarchical scheduler framework for harts

Concurrency bug detection: DataCollider
 Who?
 Microsoft Research, OSDI’10
 Problem
 Detecting concurrency data race bug is difficult.
 For large system such as Windows kernel, runtime overhead is
critical.
 Solution
 Sampling using code break point
 When a code break point is trapped,
 Set data break point for its operand
 Sleep for a while
 If the data is changed, it could be data race.

Concurrency bug detection: SyncFinder
 Who?
 UC San Diego, OSDI ’10
 Problem
 How to find ad-hoc synchronization
 Solution
 Formalize patterns of ad-hoc synchronization
 Detect such patterns using LLVM

Optimization of System Software

Memory Allocation: Hoard
 Who?
 UT, ASPLOS’00
 Problem
 Memory allocator is performance bottleneck in multi
processor environment.
 Lock contention, False sharing, Blow up

Allocator induced false sharing

Memory Allocation: Hoard (cont’d)
 Solution
 Per-processor heap to reduce
lock contention and false
sharing
 Global heap
 Borrow memory from global
heap to increase per-processor
heap
 Return memory to global heap if
there are too much free memory
in a per-processor heap

Memory Allocation: Xmalloc
 Who?
 UIUC, ICCIT’10
 Problem
 Scalable malloc for CUDA whereby hundreds of threads run
concurrently.
 Solution
 Memory allocation coalescing

System Call: FlexSC
 Who?
 University of Toronto, OSDI’10
 Problem
 Negative performance impact of system call is huge.
 Direct cost + indirect cost
 Solution
 Batching, asynchronous system call

Multikernel
 Who?
 ETH Zurich, Microsoft Research Cambridge, SOSP’09
 Problem
 System diversity
 It is no longer acceptable (or useful) to tune a general-purpose OS
design for a particular hardware model.

Multikernel (cont’d)
 Problem (cont’d)

SHM:stalled cycle (no locking!)
 The interconnects matters

8-socket Nahelem On-chip interconnects SHM vs. Message Passing

 Core diversity
 Programmable NICs
 GPU
 FPGA in CPU sockets

Multikernel (cont’d)
 Solution
 Today’s computer is already a distributed system. Why isn’t
your OS?

 Barallelfish
 Implementation of the multikernel approach
 Message passing, shared nothing, replica maintenance

An Analysis of Linux Scalability to Many
Cores
 Who?
 MIT CSAIL, OSDI’10
 Problem
 If so, is Linux scalable enough?
 Solution
 Test linux scalability using 48 Intel cores with 7 applications
 No kernel problems up to 48 cores
 3002 LOC patches

Sloopy counter
: replicated reference counter

Supporting GPU in a virtualized
environment

HyVM (Hybrid Virtual Machines)

 Who?
 Georgia Tech
 Problem
 Asymmetries in performance, memory and cache
 Functional differences
 Multiple accelerators
 Vector processor
 Floating point
 Additional instructions for accelerations
 Solution
 heterogeneity- and asymmetry-aware hypervisors

HyVM (cont’d)
 Solution (cont’d)

HyVM Architecture GViM: GPU Virtualization Architecture

Memory management in GViM Harmony CPU/GPU co-scheduling

VMGL (Virtualizing OpenGL)

 Who?
 University of Toronto, VEE’07
 Problem
 How to support OpenGL in a virtual machine environment
 Solution
 Forward OpenGL command to the driver domain

StoreGPU
 Who?
 University of British Columbia, HDPC’10
 Problem
 In CAS(Contents Addressable Storage),
 How to minimizing hash calculation cost
 Solution
 Offloading to GPU

StoreGPU Architecture

PacketShader
 Who?
 KAIST, SIGCOMM’10, NSDI’11
 Problem
 How to boot up performance of software router
 Solution
 Offload stateless (parallelizable) packet processing to GPU

PacketShader Architecture Basic Workflow of PacketShader

A Survey on in-a-box parallel computing and its implications on system software research

More Related Content

What's hot

Viewers also liked

Similar to A Survey on in-a-box parallel computing and its implications on system software research

A Survey on in-a-box parallel computing and its implications on system software research