SlideShare a Scribd company logo
1 of 31
Introduction to
Accelerators
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some sides adapted from Prof. Sanath Jayasena, University of Moratuwa
Outline
 GPUs
 CUDA programming
 Xeon Phi
2
 GPU is typically a computer card, installed into a
PCI Express slot
 Market leaders – NVIDIA, Intel, AMD (ATI)
 NVIDIA GPUs at UoM
 Intel MIC (Many Integrated Core)
Graphics Processing Unit (GPU)
3
GeForce GTX 480 Tesla 2070
Example Specifications
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes 4
GPUs (Cont.)
 Originally designed to accelerate a large no of
computations performed in graphics rendering
 Offloaded numerically intensive computation from CPU
 GPUs grew with demand for high performance
graphics
 Eventually GPUs have become much powerful,
even more than CPUs for many computations
 Cost-power-performance advantage
5
GPU Basics
 Today’s GPUs
 High performance, many-core processors that can be
used to accelerate a wide range of applications
 GPUs lead the race for floating-point performance since
start of 21st century
 GPGPU
 GPUs are being used as parallel processors for
general-purpose computation
6
CPU vs. GPU Architecture
7
GPU devotes more transistors for computation
FLOPS & GFLOPS
 FLOPS = floating-point operations per second
 Example
8
CPU GPU
Number of cores 4 448
FLOPS per core 4 1
Clock speed (GHz) 2.5 1.15
Performance (GFLOPS) 40 515
CPU-GPU Performance Gap
9
Source - http://michaelgalloy.com
Simple Performance Test
 Matrix multiplication
 C = B*A
 CPU
 Intel Core i7
 cblas_dgemm(…) from BLAS library
 GPU
 Nvidia GTX 480
10
Results
Dimensions (A,B,C)
CPU
time (s)
GPU
time (s)
Speedup
[400,800] , [400, 400], [400 , 800] 0.17 0.00109 155
[800,1600] , [800, 800], [800 , 1600] 2.10 0.00846 258
[1200,2400] , [1200, 1200], [2400 , 2400] 6.65 0.02860 232
[1600,2400] , [1600, 1600], [2400 , 2400] 15.18 0.06739 225
[2000,4000] , [4000, 4000], [4000 , 4000] 29.44 0.13178 223
[2400,4800] , [2400, 4800], [4800 , 4800] 50.21 0.22703 221
11
Applications of GPGPU
 Computational Structural Mechanics
 Bio-Informatics and Life Sciences
 Computational Electromagnetics & Electrodynamics
 Computational Finance
 Computational Fluid Dynamics
 Data Mining, Analytics, & Databases
 Imaging & Computer Vision
 Medical Imaging
 Molecular Dynamics
 Numerical Analytics
 Weather, Atmospheric, Ocean Modeling & Space
Sciences 12
Programming GPUs
 CUDA language for Nvidia GPU products
 Compute Unified Device Architecture
 Based on C
 nvcc compiler
 Lots of tools for analysis, debug, profile, …
 OpenCL – Open Computing Language
 Based on C
 Supports GPU & CPU programming
 Support for Java, Python, Matlab, etc.
 Lots of active research
 e.g., automatic code generation for GPUs 13
Multithreaded SIMD Processor
14
Caution!
 GPU designed as a numeric computing engine
 Will not perform well on some tasks as CPUs
 Most applications will use both CPUs & GPUs
 For some computations, cost of transfering data
between CPU & GPU can be high
 SIMD-type data parallelism is key to benefit from
GPUs
 … and enough of it (out of total computation)
15
CUDA Architecture
 CUDA is NVIDA’s solution to access the GPU
 Can be seen as an extension to C/C++
16
CUDA Software Stack
CUDA Architecture (Cont.)
2 main parts
1. Host (CPU part)
• Single Program, Single Data
• Launches kernel on GPU
2. Device (GPU part)
• Single Program, Multiple
Data
• Runs kernel
Function executed on GPU
(device) is called a “kernel”
17
CUDA Architecture (Cont.)
GRID Architecture
18
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory
#include <cuda.h>
#include <stdio.h>
__global__ void kernel (void)
{ }
int main (void)
{
kernel <<< 1, 1 >>> ();
printf("Hello World!n");
return 0;
}
Example Program
 “__global__” says
function is to be
compiled to run on a
“device” (GPU), not
“host” (CPU)
 Angle brackets “<<<“
& “>>>” for passing
params/args to
runtime
19
Thread Blocks
 Within host (CPU) code, call kernel by using
<<< & >>> specifying grid size (no of blocks)
& block size (no of threads)
20
Grids, Blocks & Threads
21
 Grid of size 6 (3x2
blocks)
 Each block has 12
threads (4x3)
Thread IDs
22
CUDA Device Memory Model
 Host, devices have separate memory spaces
 e.g., hardware cards with their own DRAM
 To execute a kernel on a device
 Need to allocate memory on device
 Transfer data
 Host memory  device memory
 After device execution
 Transfer results
 Device memory  host memory
 Free device memory no longer needed
23
CUDA Device Memory Model
24
CUDA API – Memory Mgt.
25
Memory Access
int main (void) {
int c, *dev_c;
cudaMalloc ((void **) &dev_c, sizeof (int));
add <<< 1, 1 >>> (2,7, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
printf(“2 + 7 = %dn“, c);
cudaFree(dev_c);
return 0;
}
26
Example
#include "stdio.h"
#define N 10
__global__ void add(int *a, int *b, int *c)
{
int tID = blockIdx.x;
if (tID < N)
{
c[tID] = a[tID] + b[tID];
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
for (int i = 0; i < N; i++)
a[i] = i, b[i] = 1;
cudaMemcpy(dev_a, a, N*sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int),
cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
printf("%d + %d = %dn", a[i], b[i], c[i]);
return 0;
}
27
Intel Xeon Phi – Many CPUs
28
Source - www.pcgameshardware.de/Xeon-Phi-Hardware-
256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/
Intel Xeon Phi Family
29
Intel Xeon Phi (Cont.)
30
Source - www.altera.com/technology/system-design/articles/2012/multicore-many-core.html
Intel Xeon Phi (Cont.)
 More simple cores, many threads, & wider
vector units
 Same programming model across host & device
 Linux on device
 Remote login
 Access to network file systems
 High compute density & energy efficiency
31

More Related Content

Similar to Introduction to Accelerators

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 

Similar to Introduction to Accelerators (20)

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 

More from Dilum Bandara

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeDilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresDilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixDilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersDilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesDilum Bandara
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesDilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionDilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPDilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery NetworksDilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingDilum Bandara
 

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Introduction to Accelerators

  • 1. Introduction to Accelerators CS4532 Concurrent Programming Dilum Bandara Dilum.Bandara@uom.lk Some sides adapted from Prof. Sanath Jayasena, University of Moratuwa
  • 2. Outline  GPUs  CUDA programming  Xeon Phi 2
  • 3.  GPU is typically a computer card, installed into a PCI Express slot  Market leaders – NVIDIA, Intel, AMD (ATI)  NVIDIA GPUs at UoM  Intel MIC (Many Integrated Core) Graphics Processing Unit (GPU) 3 GeForce GTX 480 Tesla 2070
  • 4. Example Specifications GTX 480 Tesla 2070 Tesla K80 Peak double precision FP performance 650 Gigaflops 515 Gigaflops 2.91 Teraflops Peak single precision FP performance 1.3 Teraflops 1.03 Teraflops 8.74 Teraflops CUDA cores 480 448 4992 Frequency of CUDA Cores 1.40 GHz 1.15 GHz 560/875 MHz Memory size (GDDR5) 1536 MB 6 GB 24 GB Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec ECC Memory No Yes Yes 4
  • 5. GPUs (Cont.)  Originally designed to accelerate a large no of computations performed in graphics rendering  Offloaded numerically intensive computation from CPU  GPUs grew with demand for high performance graphics  Eventually GPUs have become much powerful, even more than CPUs for many computations  Cost-power-performance advantage 5
  • 6. GPU Basics  Today’s GPUs  High performance, many-core processors that can be used to accelerate a wide range of applications  GPUs lead the race for floating-point performance since start of 21st century  GPGPU  GPUs are being used as parallel processors for general-purpose computation 6
  • 7. CPU vs. GPU Architecture 7 GPU devotes more transistors for computation
  • 8. FLOPS & GFLOPS  FLOPS = floating-point operations per second  Example 8 CPU GPU Number of cores 4 448 FLOPS per core 4 1 Clock speed (GHz) 2.5 1.15 Performance (GFLOPS) 40 515
  • 9. CPU-GPU Performance Gap 9 Source - http://michaelgalloy.com
  • 10. Simple Performance Test  Matrix multiplication  C = B*A  CPU  Intel Core i7  cblas_dgemm(…) from BLAS library  GPU  Nvidia GTX 480 10
  • 11. Results Dimensions (A,B,C) CPU time (s) GPU time (s) Speedup [400,800] , [400, 400], [400 , 800] 0.17 0.00109 155 [800,1600] , [800, 800], [800 , 1600] 2.10 0.00846 258 [1200,2400] , [1200, 1200], [2400 , 2400] 6.65 0.02860 232 [1600,2400] , [1600, 1600], [2400 , 2400] 15.18 0.06739 225 [2000,4000] , [4000, 4000], [4000 , 4000] 29.44 0.13178 223 [2400,4800] , [2400, 4800], [4800 , 4800] 50.21 0.22703 221 11
  • 12. Applications of GPGPU  Computational Structural Mechanics  Bio-Informatics and Life Sciences  Computational Electromagnetics & Electrodynamics  Computational Finance  Computational Fluid Dynamics  Data Mining, Analytics, & Databases  Imaging & Computer Vision  Medical Imaging  Molecular Dynamics  Numerical Analytics  Weather, Atmospheric, Ocean Modeling & Space Sciences 12
  • 13. Programming GPUs  CUDA language for Nvidia GPU products  Compute Unified Device Architecture  Based on C  nvcc compiler  Lots of tools for analysis, debug, profile, …  OpenCL – Open Computing Language  Based on C  Supports GPU & CPU programming  Support for Java, Python, Matlab, etc.  Lots of active research  e.g., automatic code generation for GPUs 13
  • 15. Caution!  GPU designed as a numeric computing engine  Will not perform well on some tasks as CPUs  Most applications will use both CPUs & GPUs  For some computations, cost of transfering data between CPU & GPU can be high  SIMD-type data parallelism is key to benefit from GPUs  … and enough of it (out of total computation) 15
  • 16. CUDA Architecture  CUDA is NVIDA’s solution to access the GPU  Can be seen as an extension to C/C++ 16 CUDA Software Stack
  • 17. CUDA Architecture (Cont.) 2 main parts 1. Host (CPU part) • Single Program, Single Data • Launches kernel on GPU 2. Device (GPU part) • Single Program, Multiple Data • Runs kernel Function executed on GPU (device) is called a “kernel” 17
  • 18. CUDA Architecture (Cont.) GRID Architecture 18 Grid • A group of threads all running the same kernel • Can run multiple grids at once Block • Grids composed of blocks • Each block is a logical unit containing a no of coordinating threads & some amount of shared memory
  • 19. #include <cuda.h> #include <stdio.h> __global__ void kernel (void) { } int main (void) { kernel <<< 1, 1 >>> (); printf("Hello World!n"); return 0; } Example Program  “__global__” says function is to be compiled to run on a “device” (GPU), not “host” (CPU)  Angle brackets “<<<“ & “>>>” for passing params/args to runtime 19
  • 20. Thread Blocks  Within host (CPU) code, call kernel by using <<< & >>> specifying grid size (no of blocks) & block size (no of threads) 20
  • 21. Grids, Blocks & Threads 21  Grid of size 6 (3x2 blocks)  Each block has 12 threads (4x3)
  • 23. CUDA Device Memory Model  Host, devices have separate memory spaces  e.g., hardware cards with their own DRAM  To execute a kernel on a device  Need to allocate memory on device  Transfer data  Host memory  device memory  After device execution  Transfer results  Device memory  host memory  Free device memory no longer needed 23
  • 24. CUDA Device Memory Model 24
  • 25. CUDA API – Memory Mgt. 25
  • 26. Memory Access int main (void) { int c, *dev_c; cudaMalloc ((void **) &dev_c, sizeof (int)); add <<< 1, 1 >>> (2,7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf(“2 + 7 = %dn“, c); cudaFree(dev_c); return 0; } 26
  • 27. Example #include "stdio.h" #define N 10 __global__ void add(int *a, int *b, int *c) { int tID = blockIdx.x; if (tID < N) { c[tID] = a[tID] + b[tID]; } } int main() { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; cudaMalloc((void **) &dev_a, N*sizeof(int)); cudaMalloc((void **) &dev_b, N*sizeof(int)); cudaMalloc((void **) &dev_c, N*sizeof(int)); for (int i = 0; i < N; i++) a[i] = i, b[i] = 1; cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice); add<<<N,1>>>(dev_a, dev_b, dev_c); cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost); for (int i = 0; i < N; i++) printf("%d + %d = %dn", a[i], b[i], c[i]); return 0; } 27
  • 28. Intel Xeon Phi – Many CPUs 28 Source - www.pcgameshardware.de/Xeon-Phi-Hardware- 256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/
  • 29. Intel Xeon Phi Family 29
  • 30. Intel Xeon Phi (Cont.) 30 Source - www.altera.com/technology/system-design/articles/2012/multicore-many-core.html
  • 31. Intel Xeon Phi (Cont.)  More simple cores, many threads, & wider vector units  Same programming model across host & device  Linux on device  Remote login  Access to network file systems  High compute density & energy efficiency 31