SlideShare a Scribd company logo
Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
                           advanced levels.



                Talk: Using GPUs for parallel processing
                          A. Stephen McGough


        Website: http://conferences.ncl.ac.uk/sciprog/index.php
   Research community site: contact Matt Wade for access
            Alerts mailing list: sci-prog-seminars@ncl.ac.uk
                   (sign up at http://lists.ncl.ac.uk )

Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
                 Dr Ben Allen and Gregg Iceton
Using GPUs for parallel processing

         A. Stephen McGough
Why?
       observation
• Moore’s XXXX is dead?
          law
     • “the number of transistors on integrated circuits
       doubles approximately every two years”
        – Processors aren’t getting faster… They’re getting fatter

                                  Processor Speed and Energy

                                  Assume 1 GHz Core consumes 1watt

                                  A 4GHz Core consumes ~64watts

                                  Four 1GHz cores consume ~4watts

                                  Power ~frequency3

                             Computers are going many-core
What?
• Games industry is multi-billion dollar
• Gamers want photo-realistic games
  – Computationally expensive
  – Requires complex physics calculations
• Latest generation of Graphical Processing Units
  are therefore many core parallel processors
  – General Purpose Graphical Processing Units - GPGPUs
Not just normal processors
• 1000’s of cores
  – But cores are simpler than a normal processor
  – Multiple cores perform the same action at the same
    time – Single Instruction Multiple Data – SIMD
• Conventional processor -> Minimize latency
  – Of a single program
• GPU -> Maximize throughput of all cores
• Potential for orders of magnitude speed-up
“If you were plowing a field, which would you
        rather use: two strong oxen or 1024 chicken?”

• Famous quote from Seymour Cray arguing for
  small numbers of processors
  – But the chickens are now winning
• Need a new way to think about programming
  – Need hugely parallel algorithms
     • Many existing algorithms won’t work (efficiently)
Some Issues with GPGPUs
• Cores are slower than a standard CPU
   – But you have lots more
• No direct control on when your code runs on a core
   – GPGPU decides where and when
      • Can’t communicate between cores
      • Order of execution is ‘random’
   – Synchronization is through exiting parallel GPU code
• SIMD only works (efficiently) if all cores are doing the
  same thing
   – NVIDIA GPU’s have Warps of 32 cores working together
      • Code divergence leads to more Warps
• Cores can interfere with each other
   – Overwriting each others memory
How
• Many approaches
  – OpenGL – for the mad Guru
  – Computer Unified Device Architecture (CUDA)
  – OpenCL – emerging standard
  – Dynamic Parallelism – For existing code loops
• Focus here on CUDA
  – Well developed and supported
  – Exploits full power of GPGPU
CUDA
• CUDA is a set of extensions to C/C++
   – (and Fortran)
• Code consists of sequential and parallel parts
   – Parallel parts are written as kernels
           • Describe what one thread of the code will do
 Start               Sequential code


                   Transfer data to card

                      Execute Kernel


                  Transfer data from card

  Finish             Sequential code
Example: Vector Addition
• One dimensional data
• Add two vectors (A,B) together to produce C
• Need to define the kernel to run and the main
  code
• Each thread can compute a single value for C
Example: Vector Addition
• Pseudo code for the kernel:
  – Identify which element in the vector I’m computing
     •i
  – Compute C[i] = A[i] + B[i]


• How do we identify our index (i)?
Blocks and Threads
• In CUDA the whole data
  space is the Grid
   – Divided into a number
     of blocks
      • Divided into a number of
        threads
• Blocks can be executed
  in any order
• Threads in a block are
  executed together
• Blocks and Threads can
  be 1D, 2D or 3D
Blocks
• As Blocks are
  executed in arbitrary
  order this gives
  CUDA the
  opportunity to scale
  to the number of
  cores in a particular
  device
Thread id
• CUDA provides three pieces of data for
  identifying a thread
  – BlockIdx – block identity
  – BlockDim – the size of a block (no of threads in block)
  – ThreadIdx – identity of a thread in a block
• Can use these to compute the absolute thread id
        id = BlockIdx * BlockDim + ThreadIdx
• EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1
• id = 2 * 3 + 1 = 7
        Thread index 0 1 2 0 1 2 0 1 2
                     0 1 2 3 4 5 6 7 8

                   Block0 Block1 Block2
Example: Vector Addition
                         Kernel code
                Entry point for a
                                              Normal function
                     kernel
                                                 definition


 __global__ void vector_add(double *A, double *B,
                            double* C, int N) {
   // Find my thread id - block and thread
   int id = blockDim.x * blockIdx.x + threadIdx.x;
   if (id >= N) {return;} // I'm not a valid ID
   C[id] = A[id] + B[id]; // do my work
 }                                             Compute my
                                                                absolute thread id
We might be
 invalid – if
data size not                   Do the work
 completely
 divisible by
    blocks
Example: Vector Addition
         Pseudo code for sequential code
• Create Data on Host Computer

• Create space on device

• Copy data to device
• Run Kernel
• Copy data back to host and do something with it
• Clean up
Host and Device
• Data needs copying to / from the GPU (device)
• Often end up with same data on both
  – Postscript variable names with _device or _host
     • To help identify where data is
        A_host                          A_device




         Host                           Device
Example: Vector Addition
int N = 2000;
double *A_host = new double[N]; // Create data on host computer
double *B_host = new double[N]; double *C_host = new double[N];
for(int i=0; i<N; i++) {    A_host[i] = i; B_host[i] = (double)i/N; }
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*sizeof(double));
cudaMalloc((void**) &B_device, N*sizeof(double));
cudaMalloc((void**) &C_device, N*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 256
int blocks = (N - 0.5)/256 + 1;
vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel
// Copy data back
cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost);
// do something with result

// free device memory
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); // free host memory
More Complex: Matrix Addition
• Now a 2D problem
  – BlockIdx, BlockDim, ThreadIdx now have x and y
• But general principles hold
  – For kernel
     • Compute location in matrix of two diminutions
  – For main code
     • Define and transmit data
• But keep data 1D
  – Why?
Why data in 1D?
• If you define data as 2D there is no guarantee
  that data will be a contiguous block of memory
  – Can’t be transmitted to card in one command




                    X X

                       Some other
                          data
Faking 2D data
• 2D data size N*M
• Define 1D array of size N*M
• Index element at [x,y] as
                    x*N+y
• Then can transfer to device in one go



          Row 1   Row 2   Row 3   Row 4
Example: Matrix Add
                              Kernel
__global__ void matrix_add(double *A, double *B, double* C, int N, int M)
{
  // Find my thread id - block and thread
                                                                  Both
  int idX = blockDim.x * blockIdx.x + threadIdx.x;
                                                               dimensions
  int idY = blockDim.y * blockIdx.y + threadIdx.y;
  if (idX >= N || idY >= M) {return;} // I'm not a valid ID
  int id = idY * N + idX;
                                                     Get both
  C[id] = A[id] + B[id]; // do my work
                                                    dimensions
}
                           Compute
                          1D location
Example: Matrix Addition
                              Main Code
int N = 20;
int M = 10;
double *A_host = new double[N * M]; // Create data on host computer
double *B_host = new double[N * M];
double *C_host = new double[N * M];                                         Define matrices
for(int i=0; i<N; i++) {
  for (int j = 0; j < M; j++) {
                                                                                on host
    A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M;
  }
}

double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*M*sizeof(double));
                                                                                Define space on
cudaMalloc((void**) &B_device, N*M*sizeof(double));                                  device
cudaMalloc((void**) &C_device, N*M*sizeof(double));

// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
                                                                                  Copy data to
                                                                                    device
// How many blocks will we need? Choose block size of 16
int blocksX = (N - 0.5)/16 + 1;
int blocksY = (M - 0.5)/16 + 1;
dim3 dimGrid(blocksX, blocksY);
dim3 dimBlocks(16, 16);                                                           Run Kernel
matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M);

// Copy data back from device to host
cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost);       Bring data back
// Free device
//for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]);
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host);                                                Tidy up
Running Example
• Computer: condor-gpu01
  – Set path
     • set path = ( $path /usr/local/cuda/bin/ )
• Compile command nvcc
• Then just run the binary file

• C2050, 440 cores, 3GB RAM
  – Single precision flops 1.03Tflops
  – Double precision flops 515Gflops
Summary and Questions
• GPGPU’s have great potential for parallelism
• But at a cost
   – Not ‘normal’ parallel computing
   – Need to think about problems in a new way
• Further reading
   – NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone
   – Online courses https://www.coursera.org/course/hetero
Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
                           advanced levels.



                Talk: Using GPUs for parallel processing
                          A. Stephen McGough


        Website: http://conferences.ncl.ac.uk/sciprog/index.php
   Research community site: contact Matt Wade for access
            Alerts mailing list: sci-prog-seminars@ncl.ac.uk
                   (sign up at http://lists.ncl.ac.uk )

Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
                 Dr Ben Allen and Gregg Iceton

More Related Content

What's hot

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
GLSL
GLSLGLSL
Implementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel ProgrammingImplementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel Programming
ijtsrd
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
Gary Sieling
 
Introduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic Encryption
Christoph Matthies
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryption
iosrjce
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
Agent threading model
Agent threading modelAgent threading model
Agent threading model
Judd Gaddie
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
Subhajit Sahu
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
Priyanka Aash
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
Narann29
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
Te-Yen Liu
 
Computing on Encrypted Data
Computing on Encrypted DataComputing on Encrypted Data
Computing on Encrypted Data
New York Technology Council
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...
DefCamp
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
Victor Pereira
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language
安齊 劉
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda
小明 王
 
Groovy Fly Through
Groovy Fly ThroughGroovy Fly Through
Groovy Fly Through
niklal
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
Daniel Lemire
 

What's hot (20)

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
GLSL
GLSLGLSL
GLSL
 
Implementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel ProgrammingImplementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel Programming
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Introduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic Encryption
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryption
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Agent threading model
Agent threading modelAgent threading model
Agent threading model
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
 
Computing on Encrypted Data
Computing on Encrypted DataComputing on Encrypted Data
Computing on Encrypted Data
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda
 
Groovy Fly Through
Groovy Fly ThroughGroovy Fly Through
Groovy Fly Through
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 

Viewers also liked

Parallel Processing with IPython
Parallel Processing with IPythonParallel Processing with IPython
Parallel Processing with IPython
Enthought, Inc.
 
Parallel processing & Multi level logic
Parallel processing & Multi level logicParallel processing & Multi level logic
Parallel processing & Multi level logic
Hamza Saleem
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image EnhancementParallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
Nora Youssef
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
Shahriar Parvez
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Ofir Manor
 
QGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysisQGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysis
Ross McDonald
 
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
Johann Schleier-Smith
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
Morten Andersen-Gott
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
InMobi Technology
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Information processing approach
Information processing approachInformation processing approach
Information processing approach
aj9ajeet
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 

Viewers also liked (14)

Parallel Processing with IPython
Parallel Processing with IPythonParallel Processing with IPython
Parallel Processing with IPython
 
Parallel processing & Multi level logic
Parallel processing & Multi level logicParallel processing & Multi level logic
Parallel processing & Multi level logic
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image EnhancementParallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
 
QGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysisQGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysis
 
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Information processing approach
Information processing approachInformation processing approach
Information processing approach
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 

Similar to Using GPUs for parallel processing

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
Raymond Tay
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
Himanshu577858
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe464163
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
coolmirza143
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
William Cunningham
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
Shree Kumar
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers84
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Neo4j
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Similar to Using GPUs for parallel processing (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Recently uploaded

The Six Working Genius Short Explanation
The Six Working Genius Short ExplanationThe Six Working Genius Short Explanation
The Six Working Genius Short Explanation
abijabar2
 
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
CANSA The Cancer Association of South Africa
 
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptxAssignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
kirannaveed6
 
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotesUnderstanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
healthy relationships and building a friendship
healthy relationships and building a friendshiphealthy relationships and building a friendship
healthy relationships and building a friendship
HaydarbekYuldoshev1
 
7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx
gpangilinan2017
 
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
shahul62
 
Aggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotesAggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 

Recently uploaded (9)

The Six Working Genius Short Explanation
The Six Working Genius Short ExplanationThe Six Working Genius Short Explanation
The Six Working Genius Short Explanation
 
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
 
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
 
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptxAssignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
 
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotesUnderstanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
 
healthy relationships and building a friendship
healthy relationships and building a friendshiphealthy relationships and building a friendship
healthy relationships and building a friendship
 
7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx
 
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
 
Aggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotesAggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotes
 

Using GPUs for parallel processing

  • 1. Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to advanced levels. Talk: Using GPUs for parallel processing A. Stephen McGough Website: http://conferences.ncl.ac.uk/sciprog/index.php Research community site: contact Matt Wade for access Alerts mailing list: sci-prog-seminars@ncl.ac.uk (sign up at http://lists.ncl.ac.uk ) Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough, Dr Ben Allen and Gregg Iceton
  • 2. Using GPUs for parallel processing A. Stephen McGough
  • 3. Why? observation • Moore’s XXXX is dead? law • “the number of transistors on integrated circuits doubles approximately every two years” – Processors aren’t getting faster… They’re getting fatter Processor Speed and Energy Assume 1 GHz Core consumes 1watt A 4GHz Core consumes ~64watts Four 1GHz cores consume ~4watts Power ~frequency3 Computers are going many-core
  • 4. What? • Games industry is multi-billion dollar • Gamers want photo-realistic games – Computationally expensive – Requires complex physics calculations • Latest generation of Graphical Processing Units are therefore many core parallel processors – General Purpose Graphical Processing Units - GPGPUs
  • 5. Not just normal processors • 1000’s of cores – But cores are simpler than a normal processor – Multiple cores perform the same action at the same time – Single Instruction Multiple Data – SIMD • Conventional processor -> Minimize latency – Of a single program • GPU -> Maximize throughput of all cores • Potential for orders of magnitude speed-up
  • 6. “If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?” • Famous quote from Seymour Cray arguing for small numbers of processors – But the chickens are now winning • Need a new way to think about programming – Need hugely parallel algorithms • Many existing algorithms won’t work (efficiently)
  • 7. Some Issues with GPGPUs • Cores are slower than a standard CPU – But you have lots more • No direct control on when your code runs on a core – GPGPU decides where and when • Can’t communicate between cores • Order of execution is ‘random’ – Synchronization is through exiting parallel GPU code • SIMD only works (efficiently) if all cores are doing the same thing – NVIDIA GPU’s have Warps of 32 cores working together • Code divergence leads to more Warps • Cores can interfere with each other – Overwriting each others memory
  • 8. How • Many approaches – OpenGL – for the mad Guru – Computer Unified Device Architecture (CUDA) – OpenCL – emerging standard – Dynamic Parallelism – For existing code loops • Focus here on CUDA – Well developed and supported – Exploits full power of GPGPU
  • 9. CUDA • CUDA is a set of extensions to C/C++ – (and Fortran) • Code consists of sequential and parallel parts – Parallel parts are written as kernels • Describe what one thread of the code will do Start Sequential code Transfer data to card Execute Kernel Transfer data from card Finish Sequential code
  • 10. Example: Vector Addition • One dimensional data • Add two vectors (A,B) together to produce C • Need to define the kernel to run and the main code • Each thread can compute a single value for C
  • 11. Example: Vector Addition • Pseudo code for the kernel: – Identify which element in the vector I’m computing •i – Compute C[i] = A[i] + B[i] • How do we identify our index (i)?
  • 12. Blocks and Threads • In CUDA the whole data space is the Grid – Divided into a number of blocks • Divided into a number of threads • Blocks can be executed in any order • Threads in a block are executed together • Blocks and Threads can be 1D, 2D or 3D
  • 13. Blocks • As Blocks are executed in arbitrary order this gives CUDA the opportunity to scale to the number of cores in a particular device
  • 14. Thread id • CUDA provides three pieces of data for identifying a thread – BlockIdx – block identity – BlockDim – the size of a block (no of threads in block) – ThreadIdx – identity of a thread in a block • Can use these to compute the absolute thread id id = BlockIdx * BlockDim + ThreadIdx • EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1 • id = 2 * 3 + 1 = 7 Thread index 0 1 2 0 1 2 0 1 2 0 1 2 3 4 5 6 7 8 Block0 Block1 Block2
  • 15. Example: Vector Addition Kernel code Entry point for a Normal function kernel definition __global__ void vector_add(double *A, double *B, double* C, int N) { // Find my thread id - block and thread int id = blockDim.x * blockIdx.x + threadIdx.x; if (id >= N) {return;} // I'm not a valid ID C[id] = A[id] + B[id]; // do my work } Compute my absolute thread id We might be invalid – if data size not Do the work completely divisible by blocks
  • 16. Example: Vector Addition Pseudo code for sequential code • Create Data on Host Computer • Create space on device • Copy data to device • Run Kernel • Copy data back to host and do something with it • Clean up
  • 17. Host and Device • Data needs copying to / from the GPU (device) • Often end up with same data on both – Postscript variable names with _device or _host • To help identify where data is A_host A_device Host Device
  • 18. Example: Vector Addition int N = 2000; double *A_host = new double[N]; // Create data on host computer double *B_host = new double[N]; double *C_host = new double[N]; for(int i=0; i<N; i++) { A_host[i] = i; B_host[i] = (double)i/N; } double *A_device, *B_device, *C_device; // allocate space on device GPGPU cudaMalloc((void**) &A_device, N*sizeof(double)); cudaMalloc((void**) &B_device, N*sizeof(double)); cudaMalloc((void**) &C_device, N*sizeof(double)); // Copy data from host memory to device memory cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice); // How many blocks will we need? Choose block size of 256 int blocks = (N - 0.5)/256 + 1; vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel // Copy data back cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost); // do something with result // free device memory cudaFree(A_device); cudaFree(B_device); cudaFree(C_device); free(A_host); free(B_host); free(C_host); // free host memory
  • 19. More Complex: Matrix Addition • Now a 2D problem – BlockIdx, BlockDim, ThreadIdx now have x and y • But general principles hold – For kernel • Compute location in matrix of two diminutions – For main code • Define and transmit data • But keep data 1D – Why?
  • 20. Why data in 1D? • If you define data as 2D there is no guarantee that data will be a contiguous block of memory – Can’t be transmitted to card in one command X X Some other data
  • 21. Faking 2D data • 2D data size N*M • Define 1D array of size N*M • Index element at [x,y] as x*N+y • Then can transfer to device in one go Row 1 Row 2 Row 3 Row 4
  • 22. Example: Matrix Add Kernel __global__ void matrix_add(double *A, double *B, double* C, int N, int M) { // Find my thread id - block and thread Both int idX = blockDim.x * blockIdx.x + threadIdx.x; dimensions int idY = blockDim.y * blockIdx.y + threadIdx.y; if (idX >= N || idY >= M) {return;} // I'm not a valid ID int id = idY * N + idX; Get both C[id] = A[id] + B[id]; // do my work dimensions } Compute 1D location
  • 23. Example: Matrix Addition Main Code int N = 20; int M = 10; double *A_host = new double[N * M]; // Create data on host computer double *B_host = new double[N * M]; double *C_host = new double[N * M]; Define matrices for(int i=0; i<N; i++) { for (int j = 0; j < M; j++) { on host A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M; } } double *A_device, *B_device, *C_device; // allocate space on device GPGPU cudaMalloc((void**) &A_device, N*M*sizeof(double)); Define space on cudaMalloc((void**) &B_device, N*M*sizeof(double)); device cudaMalloc((void**) &C_device, N*M*sizeof(double)); // Copy data from host memory to device memory cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice); Copy data to device // How many blocks will we need? Choose block size of 16 int blocksX = (N - 0.5)/16 + 1; int blocksY = (M - 0.5)/16 + 1; dim3 dimGrid(blocksX, blocksY); dim3 dimBlocks(16, 16); Run Kernel matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M); // Copy data back from device to host cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost); Bring data back // Free device //for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]); cudaFree(A_device); cudaFree(B_device); cudaFree(C_device); free(A_host); free(B_host); free(C_host); Tidy up
  • 24. Running Example • Computer: condor-gpu01 – Set path • set path = ( $path /usr/local/cuda/bin/ ) • Compile command nvcc • Then just run the binary file • C2050, 440 cores, 3GB RAM – Single precision flops 1.03Tflops – Double precision flops 515Gflops
  • 25. Summary and Questions • GPGPU’s have great potential for parallelism • But at a cost – Not ‘normal’ parallel computing – Need to think about problems in a new way • Further reading – NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone – Online courses https://www.coursera.org/course/hetero
  • 26. Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to advanced levels. Talk: Using GPUs for parallel processing A. Stephen McGough Website: http://conferences.ncl.ac.uk/sciprog/index.php Research community site: contact Matt Wade for access Alerts mailing list: sci-prog-seminars@ncl.ac.uk (sign up at http://lists.ncl.ac.uk ) Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough, Dr Ben Allen and Gregg Iceton