A Survey on in-a-box parallel computing and its implications on system software research
1. A Survey on in-a-box parallel computing
and its implications on system software
research
Changwoo Min (multics69@gmail.com)
2. Motivation
Technology ratios matter, Jim Gray
In the face of such "10X" forces, you can lose control of your
destiny, Andrew S Grove
What is the implications of multicore evolution for system software
researcher?
3. Survey Scope and Strategy
Parallel Parallel
Application Middleware
Parallel Programming Model
System Library
Operating System
Multicore Multicore Machine Monitor
Virtual
GPGPU
CPU CPU
Multicore Multicore
GPGPU GPGPU …
CPU CPU
4. Contents
Background
Parallel Programming Model and Productivity Tools
Optimization of System Software
Supporting GPU in a Virtualized Environment
Utilizing GPU in Middleware
Conclusion
9. OpenMP
Parallel Programming API for shared memory
multiprocessing programming in C, C++, Fortran
Use language extension – “#pragma omp”
Need compiler support
10. OpenMP (cont’d)
Fork-and-join model
Bounded parallel loop, reduction
Task-creation-and-join model
Unbounded loop, recursive algorithm, producer/consumer
11. Intel TBB (Threading Building Block)
Similar to OpenMP
API for shared memory multiprocessing
Fork-and-join
parallel-for, parallel-reduce
Task-creation-and-join
Task scheduler
Different from OpenMP
C++ template library
Concurrent container class
Hash map, vector, queue
Various synchronization mechanism
mutex, spin lock, …
Atomic type, atomic operations
Scalable memory allocator
12. Nvidia CUDA (Compute Unified Device Architecture)
CUDA
Computing engine in Nvidia GPU
Programming framework for Nvidia GPU
Use CUDA extended C
declspecs, keywords, intrinsic, runtime API, function launch, …
CUDA extended C Compiling CUDA Code Processing flow on CUDA
14. OpenCL (Open Compute Language)
CPU/GPU heterogeneous computing framework
standardized by Khronous group
OpenCL Memory Model CUDA, OpenCL Example
15. Lithe: Enabling Efficient Composition of
Parallel Libraries
Who?
ParLab, UC Berkeley, HotPar’09
Problem
Composition of parallel libraries shows performance anomaly
16. Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
Solution
Virtualized thread are bad for parallel libraries.
Harts
Unvirtualized hardware thread context
Sharing harts
Lithe
Cooperative hierarchical scheduler framework for harts
17. Concurrency bug detection: DataCollider
Who?
Microsoft Research, OSDI’10
Problem
Detecting concurrency data race bug is difficult.
For large system such as Windows kernel, runtime overhead is
critical.
Solution
Sampling using code break point
When a code break point is trapped,
Set data break point for its operand
Sleep for a while
If the data is changed, it could be data race.
18. Concurrency bug detection: SyncFinder
Who?
UC San Diego, OSDI ’10
Problem
How to find ad-hoc synchronization
Solution
Formalize patterns of ad-hoc synchronization
Detect such patterns using LLVM
20. Memory Allocation: Hoard
Who?
UT, ASPLOS’00
Problem
Memory allocator is performance bottleneck in multi
processor environment.
Lock contention, False sharing, Blow up
Allocator induced false sharing
21. Memory Allocation: Hoard (cont’d)
Solution
Per-processor heap to reduce
lock contention and false
sharing
Global heap
Borrow memory from global
heap to increase per-processor
heap
Return memory to global heap if
there are too much free memory
in a per-processor heap
22. Memory Allocation: Xmalloc
Who?
UIUC, ICCIT’10
Problem
Scalable malloc for CUDA whereby hundreds of threads run
concurrently.
Solution
Memory allocation coalescing
23. System Call: FlexSC
Who?
University of Toronto, OSDI’10
Problem
Negative performance impact of system call is huge.
Direct cost + indirect cost
Solution
Batching, asynchronous system call
25. Multikernel
Who?
ETH Zurich, Microsoft Research Cambridge, SOSP’09
Problem
System diversity
It is no longer acceptable (or useful) to tune a general-purpose OS
design for a particular hardware model.
26. Multikernel (cont’d)
Problem (cont’d)
SHM:stalled cycle (no locking!)
The interconnects matters
8-socket Nahelem On-chip interconnects SHM vs. Message Passing
Core diversity
Programmable NICs
GPU
FPGA in CPU sockets
27. Multikernel (cont’d)
Solution
Today’s computer is already a distributed system. Why isn’t
your OS?
Barallelfish
Implementation of the multikernel approach
Message passing, shared nothing, replica maintenance
28. An Analysis of Linux Scalability to Many
Cores
Who?
MIT CSAIL, OSDI’10
Problem
If so, is Linux scalable enough?
Solution
Test linux scalability using 48 Intel cores with 7 applications
No kernel problems up to 48 cores
3002 LOC patches
Sloopy counter
: replicated reference counter
32. VMGL (Virtualizing OpenGL)
Who?
University of Toronto, VEE’07
Problem
How to support OpenGL in a virtual machine environment
Solution
Forward OpenGL command to the driver domain
34. StoreGPU
Who?
University of British Columbia, HDPC’10
Problem
In CAS(Contents Addressable Storage),
How to minimizing hash calculation cost
Solution
Offloading to GPU
StoreGPU Architecture
35. PacketShader
Who?
KAIST, SIGCOMM’10, NSDI’11
Problem
How to boot up performance of software router
Solution
Offload stateless (parallelizable) packet processing to GPU
PacketShader Architecture Basic Workflow of PacketShader