SlideShare a Scribd company logo
Programar para GPUs
Alcides Fonseca
me@alcidesfonseca.com
Universidade de Coimbra, Portugal
Afinal tinhamos um Ferrari parado no nosso
computador, mesmo ao lado de um 2 Cavalos
About me
• Web Developer (Django, Ruby, PHP, …)
• Programador Excêntrico (Haskell, Scala)
• Investigador (GPGPU Programming)
• Docente (Sistemas Distribuídos, Sistemas 

Operativos e Compiladores)
Esta apresentação
• 20 Minutos - Bla bla bla
• 20 Minutos - printf(“Coden”);
• 20 Minutos - Q&A
Lei de Moore
Go multicore!
Paralelismo
Workstation

2010
Server #1

2011
Server #2

2013
CPU
Dual Core @
2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
RAM 4GB 24GB 32 GB
GPGPU
Memória
CPU
GPU
GPGPU
• Surgiu de Hackers Cientistas
• Análise visual de Robots

• Cracking de passwords UNIX

• Redes Neuronais
• Hoje em dia:
• Sequenciação de DNA

• Previsão de Sismos

• Geração de compostos Químicos

• Previsões e Análises Financeiras

• Cracking de passwords WiFi

• BitCoin Mining
Paralelismo
Workstation

2010
Server #1

2011
Server #2

2013
CPU
Dual Core @
2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
RAM 4GB 24GB 32 GB
GPU
NVIDIA
Geforce GTX
285
NVIDIA Quadro
4000
AMD Firepro
V4900
GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz)
GPU memory 1GB 2GB 1GB
Back of the napkin
Workstation

2010
Server #1

2011
Server #2

2013
CPU
2 Cores

@ 2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
CPU Cores x
Frequency
5,32 GHz <67,2 GHz <64 GHz
GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz)
GPU Cores x
Frequency
361,92 GHz 243,2 GHz 384 GHz
Benchmarks
Mas se as GPUs são assim tão poderosas, porque
é que ainda usamos CPUs???
Problema #1 - Memória limitada
Workstation

2010
Server #1

2011
Server #2

2013
RAM 4GB 24GB 32 GB
GPU memory 1GB 2GB 1GB
Problema #2 - Diferentes memórias
Lentíssimo
Problema #2 - Diferentes memórias
Problema #2 - Diferentes memórias
Problema #2 - Diferentes memórias
Problema #3 - Branching is a bad ideaAT I S T R E A M C O M P U T I N G
in turn, contain numerous processing elements, which are the fundamental,
programmable computational units that perform integer, single-precision floating-
point, double-precision floating-point, and transcendental operations. All stream
cores within a compute unit execute the same instruction sequence; different
compute units can execute different instructions.
Figure 1.2 Simplified Block Diagram of the GPU Compute Device1
1. Much of this is transparent to the programmer.
General-Purpose Registers
Branch
Execution
Unit
Processing
Element
T-Processing
Element
Instruction
and Control
Flow
Stream Core
Ultra-Threaded Dispatch Processor
Compute
Unit
Compute
Unit
Compute
Unit
Compute
Unit
if (threadId.x%2==0)
{ 

// do something

} else {

// do other thing

}

Thread Divergence
Resumindo
CPU GPU
MIMD SIMD
task parallel data parallel
low throughput high throughput
low latency high latency
Problema #4 - It’s hard
#ifndef GROUP_SIZE
#define GROUP_SIZE (64)
#endif
#ifndef OPERATIONS
#define OPERATIONS (1)
#endif
/////////////////////////////////////////////////////////////////////////////////////////////
///////
#define LOAD_GLOBAL_I2(s, i) 
vload2((size_t)(i), (__global const int*)(s))
#define STORE_GLOBAL_I2(s, i, v) 
vstore2((v), (size_t)(i), (__global int*)(s))
/////////////////////////////////////////////////////////////////////////////////////////////
///////
#define LOAD_LOCAL_I1(s, i) 
((__local const int*)(s))[(size_t)(i)]
#define STORE_LOCAL_I1(s, i, v) 
((__local int*)(s))[(size_t)(i)] = (v)
#define LOAD_LOCAL_I2(s, i) 
(int2)( (LOAD_LOCAL_I1(s, i)), 
(LOAD_LOCAL_I1(s, i + GROUP_SIZE)))
#define STORE_LOCAL_I2(s, i, v) 
STORE_LOCAL_I1(s, i, (v)[0]); 
STORE_LOCAL_I1(s, i + GROUP_SIZE, (v)[1])
#define ACCUM_LOCAL_I2(s, i, j) 
{ 
int2 x = LOAD_LOCAL_I2(s, i); 
int2 y = LOAD_LOCAL_I2(s, j); 
int2 xy = (x + y); 
STORE_LOCAL_I2(s, i, xy); 
}
/////////////////////////////////////////////////////////////////////////////////////////////
///////
__kernel void
reduce(
__global int2 *output,
__global const int2 *input,
__local int2 *shared,
const unsigned int n)
{
const int2 zero = (int2)(0.0f, 0.0f);
const unsigned int group_id = get_global_id(0) / get_local_size(0);
const unsigned int group_size = GROUP_SIZE;
const unsigned int group_stride = 2 * group_size;
const size_t local_stride = group_stride * group_size;
unsigned int op = 0;
unsigned int last = OPERATIONS - 1;
for(op = 0; op < OPERATIONS; op++)
{
const unsigned int offset = (last - op);
const size_t local_id = get_local_id(0) + offset;
STORE_LOCAL_I2(shared, local_id, zero);
size_t i = group_id * group_stride + local_id;
while (i < n)
{
int2 a = LOAD_GLOBAL_I2(input, i);
int2 b = LOAD_GLOBAL_I2(input, i + group_size);
int2 s = LOAD_LOCAL_I2(shared, local_id);
STORE_LOCAL_I2(shared, local_id, (a + b + s));
i += local_stride;
}
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 512)
if (local_id < 256) { ACCUM_LOCAL_I2(shared, local_id, local_id + 256); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 256)
if (local_id < 128) { ACCUM_LOCAL_I2(shared, local_id, local_id + 128); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 128)
if (local_id < 64) { ACCUM_LOCAL_I2(shared, local_id, local_id + 64); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 64)
if (local_id < 32) { ACCUM_LOCAL_I2(shared, local_id, local_id + 32); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 32)
if (local_id < 16) { ACCUM_LOCAL_I2(shared, local_id, local_id + 16); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 16)
if (local_id < 8) { ACCUM_LOCAL_I2(shared, local_id, local_id + 8); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 8)
if (local_id < 4) { ACCUM_LOCAL_I2(shared, local_id, local_id + 4); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 4)
if (local_id < 2) { ACCUM_LOCAL_I2(shared, local_id, local_id + 2); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 2)
if (local_id < 1) { ACCUM_LOCAL_I2(shared, local_id, local_id + 1); }
#endif
}
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == 0)
{
int2 v = LOAD_LOCAL_I2(shared, 0);
STORE_GLOBAL_I2(output, group_id, v);
}
}
int sum = 0;
for (int i=0; i<array.length; i++)
sum += array[i];
CPU sum GPU sum
Como programar para GPUs?
• CUDA (NVidia)

• OpenCL (Apple, Intel, NVidia, AMD)

• OpenACC (Microsoft)

• MATLAB

• Accelerate, MARS, ÆminiumGPU
ÆminiumGPU
3
9
4
16
5
25
6
36
map(λx . x2, [3,4,5,6])
reduce( λxy . x+y , [3,4,5,6]) 18
7 11
ÆminiumGPU Decision Mechanism
Name Size C/R Description
OuterAccess 3 C Global GPU memory read.
InnerAccess 3 C Local (thread-group) memory read. This area of the memory is faster than the global one.
ConstantAccess 3 C Constant (read-only) memory read. This memory is faster on some GPU models.
OuterWrite 3 C Write in global memory.
InnerWrite 3 C Write in local memory, which is also faster than in global.
BasicOps 3 C Simplest and fastest instructions. Include arithmetic, logical and binary operators.
TrigFuns 3 C Trigonometric functions, including sin, cos, tan, asin, acos and atan.
PowFuns 3 C pow, log and sqrt functions
CmpFuns 3 C max and min functions
Branches 3 C Number of possible branching instructions such as for, if and whiles
DataTo 1 R Size of input data transferred to the GPU in bytes.
DataFrom 1 R Size of output data transferred from the GPU in bytes.
ProgType 1 R One of the following values: Map, Reduce, PartialReduce or MapReduce, which are the
different types of operations supported by ÆminiumGPU.
Table I
LIST OF FEATURES
Código (Cuda & OpenCL)
Reduction
Input:
Reduction step 1:
Reduction step 2:
+
+
+
+
+
+
__syncthreads()
__syncthreads()
Thread Block
Avanços recentes
• Kernel calls from GPU

• Suporte para Multi-GPU

• Unified Memory

• Task parallelism (HyperQ)

• Melhores profilers

• Suporte para C++ (auto e lambda)
me@alcidesfonseca.com
Alcides Fonseca

More Related Content

What's hot

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
Shinya Takamaeda-Y
 
grsecurity and PaX
grsecurity and PaXgrsecurity and PaX
grsecurity and PaX
Kernel TLV
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
Andrea Righi
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone os
eramax
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Shinya Takamaeda-Y
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg
 
Kernel development
Kernel developmentKernel development
Kernel development
Nuno Martins
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
libfetion
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
David Evans
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
Ray Jenkins
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
idsecconf
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
Linaro
 

What's hot (20)

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
grsecurity and PaX
grsecurity and PaXgrsecurity and PaX
grsecurity and PaX
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone os
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Kernel development
Kernel developmentKernel development
Kernel development
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
 

Viewers also liked

XMPP - Beyond IM
XMPP - Beyond IMXMPP - Beyond IM
XMPP - Beyond IM
Alcides Fonseca
 
O Futuro Da Web
O Futuro Da WebO Futuro Da Web
O Futuro Da Web
Alcides Fonseca
 
Future Programming Languages
Future Programming LanguagesFuture Programming Languages
Future Programming Languages
Alcides Fonseca
 
Usabilidade
UsabilidadeUsabilidade
Usabilidade
Alcides Fonseca
 
Django
DjangoDjango
Workshop Git
Workshop GitWorkshop Git
Workshop Git
Alcides Fonseca
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming Paradigms
Janeve George
 
Introdução Web
Introdução WebIntrodução Web
Introdução Web
Alcides Fonseca
 

Viewers also liked (8)

XMPP - Beyond IM
XMPP - Beyond IMXMPP - Beyond IM
XMPP - Beyond IM
 
O Futuro Da Web
O Futuro Da WebO Futuro Da Web
O Futuro Da Web
 
Future Programming Languages
Future Programming LanguagesFuture Programming Languages
Future Programming Languages
 
Usabilidade
UsabilidadeUsabilidade
Usabilidade
 
Django
DjangoDjango
Django
 
Workshop Git
Workshop GitWorkshop Git
Workshop Git
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming Paradigms
 
Introdução Web
Introdução WebIntrodução Web
Introdução Web
 

Similar to Programar para GPUs

Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
Arjan Lamers
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
Raymond Tay
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
AnastasiaStulova
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
shimosawa
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
eurobsdcon
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
Moabi.com
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
Andrey Karpov
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
PVS-Studio
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
Ferdinand Jamitzky
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Johan Andersson
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCy
Ray Song
 

Similar to Programar para GPUs (20)

Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCy
 

Recently uploaded

HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
josephinedrea942
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
908dutch
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
karim wahed
 
Top 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your WebsiteTop 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your Website
e-Definers Technology
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
karim wahed
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
sachin chaurasia
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
bhatinidhi2001
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
Task Tracker
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
ashiklo9823
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
sheqnetworkmarketing
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
Mitchell Marsh
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Odoo E-commerce website development guides
Odoo E-commerce website development guidesOdoo E-commerce website development guides
Odoo E-commerce website development guides
jhkdigitalmarketing
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
SimonedeGijt
 

Recently uploaded (20)

HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Final Course Know...
 
Top 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your WebsiteTop 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your Website
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) .pdf
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
 
Odoo E-commerce website development guides
Odoo E-commerce website development guidesOdoo E-commerce website development guides
Odoo E-commerce website development guides
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
 

Programar para GPUs

  • 1. Programar para GPUs Alcides Fonseca me@alcidesfonseca.com Universidade de Coimbra, Portugal Afinal tinhamos um Ferrari parado no nosso computador, mesmo ao lado de um 2 Cavalos
  • 2. About me • Web Developer (Django, Ruby, PHP, …) • Programador Excêntrico (Haskell, Scala) • Investigador (GPGPU Programming) • Docente (Sistemas Distribuídos, Sistemas 
 Operativos e Compiladores)
  • 3. Esta apresentação • 20 Minutos - Bla bla bla • 20 Minutos - printf(“Coden”); • 20 Minutos - Q&A
  • 4. Lei de Moore Go multicore!
  • 5. Paralelismo Workstation 2010 Server #1 2011 Server #2 2013 CPU Dual Core @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz RAM 4GB 24GB 32 GB
  • 7. GPGPU • Surgiu de Hackers Cientistas • Análise visual de Robots • Cracking de passwords UNIX • Redes Neuronais • Hoje em dia: • Sequenciação de DNA • Previsão de Sismos • Geração de compostos Químicos • Previsões e Análises Financeiras • Cracking de passwords WiFi • BitCoin Mining
  • 8. Paralelismo Workstation 2010 Server #1 2011 Server #2 2013 CPU Dual Core @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz RAM 4GB 24GB 32 GB GPU NVIDIA Geforce GTX 285 NVIDIA Quadro 4000 AMD Firepro V4900 GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz) GPU memory 1GB 2GB 1GB
  • 9. Back of the napkin Workstation 2010 Server #1 2011 Server #2 2013 CPU 2 Cores @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz CPU Cores x Frequency 5,32 GHz <67,2 GHz <64 GHz GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz) GPU Cores x Frequency 361,92 GHz 243,2 GHz 384 GHz
  • 11. Mas se as GPUs são assim tão poderosas, porque é que ainda usamos CPUs???
  • 12. Problema #1 - Memória limitada Workstation 2010 Server #1 2011 Server #2 2013 RAM 4GB 24GB 32 GB GPU memory 1GB 2GB 1GB
  • 13. Problema #2 - Diferentes memórias Lentíssimo
  • 14. Problema #2 - Diferentes memórias
  • 15. Problema #2 - Diferentes memórias
  • 16. Problema #2 - Diferentes memórias
  • 17. Problema #3 - Branching is a bad ideaAT I S T R E A M C O M P U T I N G in turn, contain numerous processing elements, which are the fundamental, programmable computational units that perform integer, single-precision floating- point, double-precision floating-point, and transcendental operations. All stream cores within a compute unit execute the same instruction sequence; different compute units can execute different instructions. Figure 1.2 Simplified Block Diagram of the GPU Compute Device1 1. Much of this is transparent to the programmer. General-Purpose Registers Branch Execution Unit Processing Element T-Processing Element Instruction and Control Flow Stream Core Ultra-Threaded Dispatch Processor Compute Unit Compute Unit Compute Unit Compute Unit if (threadId.x%2==0) { // do something } else { // do other thing } Thread Divergence
  • 18. Resumindo CPU GPU MIMD SIMD task parallel data parallel low throughput high throughput low latency high latency
  • 19. Problema #4 - It’s hard #ifndef GROUP_SIZE #define GROUP_SIZE (64) #endif #ifndef OPERATIONS #define OPERATIONS (1) #endif ///////////////////////////////////////////////////////////////////////////////////////////// /////// #define LOAD_GLOBAL_I2(s, i) vload2((size_t)(i), (__global const int*)(s)) #define STORE_GLOBAL_I2(s, i, v) vstore2((v), (size_t)(i), (__global int*)(s)) ///////////////////////////////////////////////////////////////////////////////////////////// /////// #define LOAD_LOCAL_I1(s, i) ((__local const int*)(s))[(size_t)(i)] #define STORE_LOCAL_I1(s, i, v) ((__local int*)(s))[(size_t)(i)] = (v) #define LOAD_LOCAL_I2(s, i) (int2)( (LOAD_LOCAL_I1(s, i)), (LOAD_LOCAL_I1(s, i + GROUP_SIZE))) #define STORE_LOCAL_I2(s, i, v) STORE_LOCAL_I1(s, i, (v)[0]); STORE_LOCAL_I1(s, i + GROUP_SIZE, (v)[1]) #define ACCUM_LOCAL_I2(s, i, j) { int2 x = LOAD_LOCAL_I2(s, i); int2 y = LOAD_LOCAL_I2(s, j); int2 xy = (x + y); STORE_LOCAL_I2(s, i, xy); } ///////////////////////////////////////////////////////////////////////////////////////////// /////// __kernel void reduce( __global int2 *output, __global const int2 *input, __local int2 *shared, const unsigned int n) { const int2 zero = (int2)(0.0f, 0.0f); const unsigned int group_id = get_global_id(0) / get_local_size(0); const unsigned int group_size = GROUP_SIZE; const unsigned int group_stride = 2 * group_size; const size_t local_stride = group_stride * group_size; unsigned int op = 0; unsigned int last = OPERATIONS - 1; for(op = 0; op < OPERATIONS; op++) { const unsigned int offset = (last - op); const size_t local_id = get_local_id(0) + offset; STORE_LOCAL_I2(shared, local_id, zero); size_t i = group_id * group_stride + local_id; while (i < n) { int2 a = LOAD_GLOBAL_I2(input, i); int2 b = LOAD_GLOBAL_I2(input, i + group_size); int2 s = LOAD_LOCAL_I2(shared, local_id); STORE_LOCAL_I2(shared, local_id, (a + b + s)); i += local_stride; } barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 512) if (local_id < 256) { ACCUM_LOCAL_I2(shared, local_id, local_id + 256); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 256) if (local_id < 128) { ACCUM_LOCAL_I2(shared, local_id, local_id + 128); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 128) if (local_id < 64) { ACCUM_LOCAL_I2(shared, local_id, local_id + 64); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 64) if (local_id < 32) { ACCUM_LOCAL_I2(shared, local_id, local_id + 32); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 32) if (local_id < 16) { ACCUM_LOCAL_I2(shared, local_id, local_id + 16); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 16) if (local_id < 8) { ACCUM_LOCAL_I2(shared, local_id, local_id + 8); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 8) if (local_id < 4) { ACCUM_LOCAL_I2(shared, local_id, local_id + 4); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 4) if (local_id < 2) { ACCUM_LOCAL_I2(shared, local_id, local_id + 2); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 2) if (local_id < 1) { ACCUM_LOCAL_I2(shared, local_id, local_id + 1); } #endif } barrier(CLK_LOCAL_MEM_FENCE); if (get_local_id(0) == 0) { int2 v = LOAD_LOCAL_I2(shared, 0); STORE_GLOBAL_I2(output, group_id, v); } } int sum = 0; for (int i=0; i<array.length; i++) sum += array[i]; CPU sum GPU sum
  • 20. Como programar para GPUs? • CUDA (NVidia) • OpenCL (Apple, Intel, NVidia, AMD) • OpenACC (Microsoft) • MATLAB • Accelerate, MARS, ÆminiumGPU
  • 21. ÆminiumGPU 3 9 4 16 5 25 6 36 map(λx . x2, [3,4,5,6]) reduce( λxy . x+y , [3,4,5,6]) 18 7 11
  • 22. ÆminiumGPU Decision Mechanism Name Size C/R Description OuterAccess 3 C Global GPU memory read. InnerAccess 3 C Local (thread-group) memory read. This area of the memory is faster than the global one. ConstantAccess 3 C Constant (read-only) memory read. This memory is faster on some GPU models. OuterWrite 3 C Write in global memory. InnerWrite 3 C Write in local memory, which is also faster than in global. BasicOps 3 C Simplest and fastest instructions. Include arithmetic, logical and binary operators. TrigFuns 3 C Trigonometric functions, including sin, cos, tan, asin, acos and atan. PowFuns 3 C pow, log and sqrt functions CmpFuns 3 C max and min functions Branches 3 C Number of possible branching instructions such as for, if and whiles DataTo 1 R Size of input data transferred to the GPU in bytes. DataFrom 1 R Size of output data transferred from the GPU in bytes. ProgType 1 R One of the following values: Map, Reduce, PartialReduce or MapReduce, which are the different types of operations supported by ÆminiumGPU. Table I LIST OF FEATURES
  • 23. Código (Cuda & OpenCL)
  • 24. Reduction Input: Reduction step 1: Reduction step 2: + + + + + + __syncthreads() __syncthreads() Thread Block
  • 25. Avanços recentes • Kernel calls from GPU • Suporte para Multi-GPU • Unified Memory • Task parallelism (HyperQ) • Melhores profilers • Suporte para C++ (auto e lambda)