Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Rise of Parallel Computing


Published on

Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.

Published in: Technology
  • Be the first to like this

The Rise of Parallel Computing

  1. 1. The Rise of ParallelComputing Ben Baker
  2. 2. Moore’s Law"The number of transistors incorporated in a chipwill approximately double every 24 months." Gordon Moore, Intel Co-Founder Originally published in 1965
  3. 3. So What’s the Problem?• Can continue to increase transistors per Moore’s Law• Cannot continue to increase power or chips will melt – Power steadily rose with new chips until ~2005 – now 1 volt• Cannot continue to scale processor frequency – Have you seen any 10 GHz chips? Moore’s Law gave no prediction of continued performance increases
  4. 4. Time to “Take the Leap”“We have reached the limit of what is possible withone or more traditional, serial central processingunits, or CPUs. It is past time for the computingindustry – and everyone who relies on it forcontinued improvements in productivity, economicgrowth and social progress – to take the leap intoparallel processing.” Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University
  5. 5. Additional Resources• Stanford course available on iTunes U• – Programming Massively Parallel Processors with CUDA – Lectures 1 and 13 are great introductions • Lecture 13 – The Future of Throughput Computing (Bill Dally) • Lecture 1 – Introduction to Massively Parallel Computing
  6. 6. Guiding Principles• Performance = Parallelism – Single-threaded processor performance has flat- lined at 0-5% annual growth since ~2005• Efficiency = Locality – Chips are power limited with most power spent moving data around
  7. 7. Three Types of Parallelism• Instruction-level parallelism – Out of order execution, branch prediction, etc. – Opportunities decreasing• Data-level parallelism – SIMD (Single Instruction Multiple Data), GPUs, etc. – Opportunities increasing• Thread-level parallelism – Multithreading, multi-core CPUs, etc. – Opportunities increasing
  8. 8. Taking the Leap• Three things are required – Lots of processors – Efficienct memory storage – Programming system that abstracts it
  9. 9. CPU VS. GPU ARCHITECTURE CPU GPU• General purpose • Special purpose processors processors• Optimized for • Optimized for data level instruction level parallelism parallelism • Many smaller processors• A few large processors executing single capable of multi- instructions on multiple threading data (SIMD)
  10. 10. High Performance GPU Computing• GPUs are getting faster more quickly than CPUs• Being used in industry for weather simulation, medical imaging, computational finance, etc.• Amazon is now offering access to NVIDIA Tesla GPUs in the cloud as a service ($ vs ¢ per hour)• GPUs are being used as general purpose parallel processors –
  11. 11. Examples• CUDA – NVIDIA• C++ AMP – Microsoft• OpenCL – Open source• NPP – NVIDIA (Research done at FamilySearch)
  12. 12. CUDA• Compute Unified Device Architecture• Proprietary NVIDIA extensions to C for running code on NVIDIA GPUs• Other language bindings – Java – jCUDA, JCuda, JCublas, JCufft – Python – PyCUDA, KappaCUDA – .NET – CUDAfy.NET, CUDA.NET – Ruby – KappaCUDA – More – Fortran, Perl, Mathematica, MATLAB, etc.
  13. 13. C for CUDA Example// Compute vector sum c = a + b// Each thread performs one pair-wise addition__global__ void vector_add(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];{int main(){ // Allocate and initialize host (CPU) memory float* hostA = …, *hostB = …; // Allocate device (GPU) memory cudaMalloc((void**) &deviceA, N * sizeof(float)); cudaMalloc((void**) &deviceB, N * sizeof(float)); cudaMalloc((void**) &deviceC, N * sizeof(float)); // Copy host memory to device cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice)); cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice)); // Run N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC);}
  14. 14. Heterogeneous Computing with Microsoft C++ AMP• AMP = Accelerated Massive Parallelism• Designed to take advantage of all the available compute resources (CPU, integrated & discrete GPUs)• Coming in the next version of Visual Studio and C++ in the next year or two• Cool demo
  15. 15. EXAMPLE – C++ AMPvoid MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W){ for (int y = 0; y < M; y++) { for (int x = 0; x < N; x++) { float sum = 0; for (int i = 0; i < W; i++) sum += A(y*W + i] * B[i*N + x); C[y*N + x] = sum; } }}void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W){ array_view<const float, 2> a (M, W, A), b(W, N, B); array_view<writeonly<float>, 2>c((M, N, C); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { float sum = 0; for (int i = 0; i < a.x; i++) sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; });}
  16. 16. OpenCL• Royalty free, cross-platform, vendor neutral• Managed by Khronos OpenCL working group (• Design goal to use all computational resources – GPUs and CPUs are peers• Based on C• Abstract the specifics of underlying hardware
  17. 17. Example – OpenCLvoid trad_mul(int n, const float *a, const float* b, float* c){ for (int i = 0; i < n; i++) c[i] = a[i] * b[i];} kernel void dp_mul(global const float *a, global const float* b, global float* c){ int id = get_global_id(0); c[id] = a[id] * b[id];} // Execute over “n’ work-items
  18. 18. Image Processing Flow at FamilySearch Preservation Storage (Lossless JPEG-2000)Image Capture(Uncompressed TIFF) Image Post-ProcessingMicrofilm Scanners (DPC)Digital Cameras Distribution Storage (JPEG - original size) (JPEG - thumbnails)
  19. 19. Digital Processing Center (DPC)• Collection of servers in a data center used by FamilySearch to continuously process millions of images annually• Image post processing operations performed include – Automatic skew correction – Automatic document cropping – Image sharpening – Image scaling (thumbnail creation) – Encoding into other image formats• CPU is a current bottleneck (~12 sec/image)• Processing requirements continuously rising (number of images, image size and number of color channels)
  20. 20. Computer Graphics vs. Computer Vision• Approximate inverses of each other: – Computer graphics – converting “numbers into pictures” – Computer vision – converting “pictures into numbers”• GPUs have traditionally been used for computer graphics – (Ex. Graphics intensive computer games)• Recent research, hardware and software are using GPUs for computer vision (Ex. Using Graphics Devices in Reverse)• GPUs generally work well when there is ample data- level parallelism
  21. 21. IMPLEMENTATION OPTIONSRack Mount Servers Personal Supercomputer• Several vendors provide solutions. • GPUs for computing can be placed in (Ex. One is a 3U rack mount unit a standard workstation. Several capable of holding 16 GPUs vendors provide solutions. connected to 8 servers) • Each Tesla GPU requires• “Compared to typical quad-core – Available double-wide PCIe slot CPUs, Tesla 20 series computing – Two 6-pin or one 8-pin PCIe power systems deliver equivalent connectors and sufficient wattage performance at 1/10th the cost – Recommend 4GB RAM per card, at and 1/20th the power least 2.33 GHz quad-core CPU and consumption.” (NVIDIA) 64-bit Linux or Windows • “250x the computing performance of a standard workstation” (NVIDIA)
  22. 22. Image Processing Performance with IPP and NPP• FamilySearch currently uses Intel’s IPP – Intel Performance Primitives – Optimize operations on Intel CPUs – Closed source, licensed• NVIDIA has produced a similar library called NPP – NVIDIA Performance Primitives – Optimize operations on NVIDIA GPUs (CUDA underneath) – Higher level abstraction to perform image processing on GPUs – No license for SDK
  23. 23. EXAMPLE – NPP // Declare a host object for an 8-bit grayscale image npp::ImageCPU_8u_C1 hostSrc; // Load grayscale image from disk npp::loadImage(sFilename, hostSrc); // Declare a device image and upload from host npp::ImageNPP_8u_C1 deviceSrc(hostSrc);… [Create padded image]… [Create Gaussian kernel] … [Create padded image] … [Create Gaussian kernel] // Copy kernel to GPU cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width * sizeof(Npp32s), kernelSize.width * sizeof(Npp32s), kernelSize.height, cudaMemcpyHostToDevice);// Allocate blurred image of appropriate size // Allocate blurred image of appropriate size (on GPU)Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(), npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width, img.getHeight(), &blurredImgStepSz); imgSz.height);// Perform the filter // Perform the filterippiFilter32f_8u_C1R(paddedImgData, nppiFilter_8u_C1R(, paddedImage.getStepSize(), blurredImg, heightOffset), paddedImg.pitch(), blurredImgStepSz, imgSz, kernel, kernelSize,, deviceBlurredImg.pitch(), kernelAnchor); imgSz, deviceKernel, kernelSize, kernelAnchor, divisor); // Declare a host image for the result npp::ImageCPU_8u_C1 hostBlurredImage(deviceBlurredImg.size()); // Copy the device result data into it deviceBlurredImg.copyTo(, hostBlurredImg.pitch());
  24. 24. Performance Testing Methodology• Test System Specifications – Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores total) – 6 GB RAM – 64-bit Windows 7 operating system – Single Tesla C1060 Compute Processor (240 processing cores total) – PCI-Express x16 Gen2 slot• Three representative grayscale images of increasing size – Small image – 1726 x 1450 (2.5 megapixels) – Average image – 4808 x 3940 (18.9 megapixels) – Large image – 8966 x 6132 (55.0 megapixels)• Results for each image repeated 3 times and averaged• Transfer time to/from the GPU is considered part of all GPU operations
  25. 25. • Combining operations minimizes GPU/CPU transfers• 5 – 6x speed up, increasing slightly with image size
  26. 26. AMDAHL’S LAWSpeeding up 25% of anoverall process by 10x isless of an overallimprovement thanspeeding up 75% of anoverall process by 1.5x
  27. 27. Takeaways• Significant performance increases can be realized through parallelization – may become only way in the future• GPUs are transforming into general purpose data-parallel computational coprocessors and outstripping advances in multi- core CPUs• Languages, tools and APIs for parallel computing remain relatively immature, but are improving rapidly• Relatively small learning curve – For image processing, NPP’s API nearly perfectly matches Intel’s IPP – New paradigms around copying to/from GPU and allocating memory – Can use programming languages familiar to developers without understanding intricacies of GPU architectures – Does require rethinking of algorithms to be parallel and building the computation around the data