Your SlideShare is downloading. ×
The Rise of Parallel Computing
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Rise of Parallel Computing


Published on

Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.

Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.

Published in: Technology

1 Comment
  • really nice ppt but I wounder about implementations specially the curve of CPU and GPU bench marks and difference of performance on computational times.
    how I can see your codes which represent that curve.
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Don’t claim to be expert
  • Source of much of what I will present – gives a lot more details, coming from people who know a lot more than I do
  • Even CPUs realize performance is about parallelism – multi-core CPUsPower required increases exponentially with distance – Bill Dally says that lots of arithmetic units actually not hot
  • GPUs initially only for computer graphics acceleration
  • Of course want something that is open
  • Number of images increasing as is size, more color, etc.
  • Data center servers for large scale places like FamilySearch, Workstations could be put in smaller installations such as an archiveBased on limited survey (most sites don’t list prices)~$5-6K list price for 1U server or personal supercomputer w/2 Teslas~$8-9K list price for 1U server or personal supercomputer w/4 Teslas~$1200 per Tesla
  • NVIDIA directly going at IPPImaging library structured so that we could create implementation for GPUs to run on a single GPU based server concurrent with current system
  • Rotating, cropping, sharpening and scaling operations parallelized on GPU
  • Transcript

    • 1. The Rise of ParallelComputing Ben Baker
    • 2. Moore’s Law"The number of transistors incorporated in a chipwill approximately double every 24 months." Gordon Moore, Intel Co-Founder Originally published in 1965
    • 3. So What’s the Problem?• Can continue to increase transistors per Moore’s Law• Cannot continue to increase power or chips will melt – Power steadily rose with new chips until ~2005 – now 1 volt• Cannot continue to scale processor frequency – Have you seen any 10 GHz chips? Moore’s Law gave no prediction of continued performance increases
    • 4. Time to “Take the Leap”“We have reached the limit of what is possible withone or more traditional, serial central processingunits, or CPUs. It is past time for the computingindustry – and everyone who relies on it forcontinued improvements in productivity, economicgrowth and social progress – to take the leap intoparallel processing.” Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University
    • 5. Additional Resources• Stanford course available on iTunes U• – Programming Massively Parallel Processors with CUDA – Lectures 1 and 13 are great introductions • Lecture 13 – The Future of Throughput Computing (Bill Dally) • Lecture 1 – Introduction to Massively Parallel Computing
    • 6. Guiding Principles• Performance = Parallelism – Single-threaded processor performance has flat- lined at 0-5% annual growth since ~2005• Efficiency = Locality – Chips are power limited with most power spent moving data around
    • 7. Three Types of Parallelism• Instruction-level parallelism – Out of order execution, branch prediction, etc. – Opportunities decreasing• Data-level parallelism – SIMD (Single Instruction Multiple Data), GPUs, etc. – Opportunities increasing• Thread-level parallelism – Multithreading, multi-core CPUs, etc. – Opportunities increasing
    • 8. Taking the Leap• Three things are required – Lots of processors – Efficienct memory storage – Programming system that abstracts it
    • 9. CPU VS. GPU ARCHITECTURE CPU GPU• General purpose • Special purpose processors processors• Optimized for • Optimized for data level instruction level parallelism parallelism • Many smaller processors• A few large processors executing single capable of multi- instructions on multiple threading data (SIMD)
    • 10. High Performance GPU Computing• GPUs are getting faster more quickly than CPUs• Being used in industry for weather simulation, medical imaging, computational finance, etc.• Amazon is now offering access to NVIDIA Tesla GPUs in the cloud as a service ($ vs ¢ per hour)• GPUs are being used as general purpose parallel processors –
    • 11. Examples• CUDA – NVIDIA• C++ AMP – Microsoft• OpenCL – Open source• NPP – NVIDIA (Research done at FamilySearch)
    • 12. CUDA• Compute Unified Device Architecture• Proprietary NVIDIA extensions to C for running code on NVIDIA GPUs• Other language bindings – Java – jCUDA, JCuda, JCublas, JCufft – Python – PyCUDA, KappaCUDA – .NET – CUDAfy.NET, CUDA.NET – Ruby – KappaCUDA – More – Fortran, Perl, Mathematica, MATLAB, etc.
    • 13. C for CUDA Example// Compute vector sum c = a + b// Each thread performs one pair-wise addition__global__ void vector_add(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];{int main(){ // Allocate and initialize host (CPU) memory float* hostA = …, *hostB = …; // Allocate device (GPU) memory cudaMalloc((void**) &deviceA, N * sizeof(float)); cudaMalloc((void**) &deviceB, N * sizeof(float)); cudaMalloc((void**) &deviceC, N * sizeof(float)); // Copy host memory to device cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice)); cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice)); // Run N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC);}
    • 14. Heterogeneous Computing with Microsoft C++ AMP• AMP = Accelerated Massive Parallelism• Designed to take advantage of all the available compute resources (CPU, integrated & discrete GPUs)• Coming in the next version of Visual Studio and C++ in the next year or two• Cool demo
    • 15. EXAMPLE – C++ AMPvoid MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W){ for (int y = 0; y < M; y++) { for (int x = 0; x < N; x++) { float sum = 0; for (int i = 0; i < W; i++) sum += A(y*W + i] * B[i*N + x); C[y*N + x] = sum; } }}void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W){ array_view<const float, 2> a (M, W, A), b(W, N, B); array_view<writeonly<float>, 2>c((M, N, C); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { float sum = 0; for (int i = 0; i < a.x; i++) sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; });}
    • 16. OpenCL• Royalty free, cross-platform, vendor neutral• Managed by Khronos OpenCL working group (• Design goal to use all computational resources – GPUs and CPUs are peers• Based on C• Abstract the specifics of underlying hardware
    • 17. Example – OpenCLvoid trad_mul(int n, const float *a, const float* b, float* c){ for (int i = 0; i < n; i++) c[i] = a[i] * b[i];} kernel void dp_mul(global const float *a, global const float* b, global float* c){ int id = get_global_id(0); c[id] = a[id] * b[id];} // Execute over “n’ work-items
    • 18. Image Processing Flow at FamilySearch Preservation Storage (Lossless JPEG-2000)Image Capture(Uncompressed TIFF) Image Post-ProcessingMicrofilm Scanners (DPC)Digital Cameras Distribution Storage (JPEG - original size) (JPEG - thumbnails)
    • 19. Digital Processing Center (DPC)• Collection of servers in a data center used by FamilySearch to continuously process millions of images annually• Image post processing operations performed include – Automatic skew correction – Automatic document cropping – Image sharpening – Image scaling (thumbnail creation) – Encoding into other image formats• CPU is a current bottleneck (~12 sec/image)• Processing requirements continuously rising (number of images, image size and number of color channels)
    • 20. Computer Graphics vs. Computer Vision• Approximate inverses of each other: – Computer graphics – converting “numbers into pictures” – Computer vision – converting “pictures into numbers”• GPUs have traditionally been used for computer graphics – (Ex. Graphics intensive computer games)• Recent research, hardware and software are using GPUs for computer vision (Ex. Using Graphics Devices in Reverse)• GPUs generally work well when there is ample data- level parallelism
    • 21. IMPLEMENTATION OPTIONSRack Mount Servers Personal Supercomputer• Several vendors provide solutions. • GPUs for computing can be placed in (Ex. One is a 3U rack mount unit a standard workstation. Several capable of holding 16 GPUs vendors provide solutions. connected to 8 servers) • Each Tesla GPU requires• “Compared to typical quad-core – Available double-wide PCIe slot CPUs, Tesla 20 series computing – Two 6-pin or one 8-pin PCIe power systems deliver equivalent connectors and sufficient wattage performance at 1/10th the cost – Recommend 4GB RAM per card, at and 1/20th the power least 2.33 GHz quad-core CPU and consumption.” (NVIDIA) 64-bit Linux or Windows • “250x the computing performance of a standard workstation” (NVIDIA)
    • 22. Image Processing Performance with IPP and NPP• FamilySearch currently uses Intel’s IPP – Intel Performance Primitives – Optimize operations on Intel CPUs – Closed source, licensed• NVIDIA has produced a similar library called NPP – NVIDIA Performance Primitives – Optimize operations on NVIDIA GPUs (CUDA underneath) – Higher level abstraction to perform image processing on GPUs – No license for SDK
    • 23. EXAMPLE – NPP // Declare a host object for an 8-bit grayscale image npp::ImageCPU_8u_C1 hostSrc; // Load grayscale image from disk npp::loadImage(sFilename, hostSrc); // Declare a device image and upload from host npp::ImageNPP_8u_C1 deviceSrc(hostSrc);… [Create padded image]… [Create Gaussian kernel] … [Create padded image] … [Create Gaussian kernel] // Copy kernel to GPU cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width * sizeof(Npp32s), kernelSize.width * sizeof(Npp32s), kernelSize.height, cudaMemcpyHostToDevice);// Allocate blurred image of appropriate size // Allocate blurred image of appropriate size (on GPU)Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(), npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width, img.getHeight(), &blurredImgStepSz); imgSz.height);// Perform the filter // Perform the filterippiFilter32f_8u_C1R(paddedImgData, nppiFilter_8u_C1R(, paddedImage.getStepSize(), blurredImg, heightOffset), paddedImg.pitch(), blurredImgStepSz, imgSz, kernel, kernelSize,, deviceBlurredImg.pitch(), kernelAnchor); imgSz, deviceKernel, kernelSize, kernelAnchor, divisor); // Declare a host image for the result npp::ImageCPU_8u_C1 hostBlurredImage(deviceBlurredImg.size()); // Copy the device result data into it deviceBlurredImg.copyTo(, hostBlurredImg.pitch());
    • 24. Performance Testing Methodology• Test System Specifications – Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores total) – 6 GB RAM – 64-bit Windows 7 operating system – Single Tesla C1060 Compute Processor (240 processing cores total) – PCI-Express x16 Gen2 slot• Three representative grayscale images of increasing size – Small image – 1726 x 1450 (2.5 megapixels) – Average image – 4808 x 3940 (18.9 megapixels) – Large image – 8966 x 6132 (55.0 megapixels)• Results for each image repeated 3 times and averaged• Transfer time to/from the GPU is considered part of all GPU operations
    • 25. • Combining operations minimizes GPU/CPU transfers• 5 – 6x speed up, increasing slightly with image size
    • 26. AMDAHL’S LAWSpeeding up 25% of anoverall process by 10x isless of an overallimprovement thanspeeding up 75% of anoverall process by 1.5x
    • 27. Takeaways• Significant performance increases can be realized through parallelization – may become only way in the future• GPUs are transforming into general purpose data-parallel computational coprocessors and outstripping advances in multi- core CPUs• Languages, tools and APIs for parallel computing remain relatively immature, but are improving rapidly• Relatively small learning curve – For image processing, NPP’s API nearly perfectly matches Intel’s IPP – New paradigms around copying to/from GPU and allocating memory – Can use programming languages familiar to developers without understanding intricacies of GPU architectures – Does require rethinking of algorithms to be parallel and building the computation around the data