[若渴計畫]由GPU硬體概念到coding CUDA

由GPU硬體概念到coding
CUDA
AJ
2014.6.17
GPU是否只能當顯示卡?
能不能拿來做平行運算?
兩個大廠
• NVIDIA
• AMD
• 這兩大廠都有提供open source project給玩家來join
• 能join什麼? 還沒涉略
• 因為我的實驗室只有NVIDA卡,所以就使用NVIDA ~”~
• NVIDA卡,它是使用何種programming model來programming?
• Single-instruction multiple thread (SIMT) programming model
使用此model帶來給你
怎樣的設計概念
從NVIDIA GPU設計概念說起
在NVIDIA GPU中,可用三個特性來看SIMT
• Single instruction, multiple register sets
• Single instruction, multiple addresses
• Single instruction, multiple flow paths
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Single Instruction, Multiple Register Sets
for(i=0;i<n;++i) a[i]=b[i]+c[i];
__global__ void add(float *a, float *b, float *c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i]=b[i]+c[i]; //no loop!
}
Costs:
• 每個thread都會對應自己的register set ,所以會有redundant情況發生。
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Single Instruction, Multiple Addresses
__global__ void apply(short* a, short* b, short* lut) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i] = lut[b[i]]; //indirect memory access
// a[i] = b[i]
}
Cost:
• 對於DRAM memory來說, random access跟循序存取比起來是沒有效率
的。
• 對於shared memory來說, random access 會藉由bank contentions而變
慢速度。(先不討論shared memory)
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Single Instruction, Multiple Flow Paths
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
__global__ void find(int* vec, int len,
int* ind, int* nfound,
int nthreads) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int last = 0;
int* myind = ind + tid*len;
for(int i=tid; i<len; i+=nthreads) {
if( vec[i] ) { //flow divergence
myind[last] = i;
last++;
}
}
nfound[tid] = last;
}
….
len thread id = 0
nthreads =1vec
get coalescing to read
if(vec[i]) 成立 if(vec[i]) 不成成立
get coalescing to read
….
registers
以上為SIMT設計特性。
先來看kepler gk110 晶片方塊圖。
• 15 SMX(串流處理器) X 192 cores
• 4 warp scheduler per SMX
• 暫存器個數65536 per SMX
Form NVIDIA kepler gk110 architecture whitepaper
• warp scheduler 用來做啥?
• SMX內部的資源分配
Form NVIDIA kepler gk110 architecture whitepaper
warp1 warp2
Warp使用SIMT運作
1. 在NVIDIA中, a “warp”是由好幾個(32)threads組成
且同時跑。 而每個thread需要自己的registers 。
2.在Warp中,SIMT去執行,也就是說32 threads執行相
同指令。如果對於flow divergence ,則硬體會分多個warp處
理這問題,但效能會變差。(James Balfour, “CUDA
Threads and Atomics” ,p.13~p.18) 。
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
[若渴計畫]由GPU硬體概念到coding CUDA
Warp使用SIMT運作(cont.)
1.. 好幾個warps組成a “block” , 一個block被對應到一
個SMX ,而一個SMX裡面有warp scheduler去切換一個
block中的warps去執行。 而每個warp都有自己的
register sets。
2. 由圖可知一個block ,再做warp schedule時,是zero
overhead (fast context switch)。因為狀態接由
register set保存。而warp狀態可分actives/suspended 。
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
3. 你可以指定一個block有多少thread。但一個block做多指
定多少thread ,要看硬體可支援的運算能力。
Thread ID 如何對應到 Warp
• Warp ID (warpid)
• 如何知道一個block中某thread屬於哪個warp? threadIdx.x / 32
• Thread ID = 0 ~ 31  warp
• Thread ID = 32~64  warp
• …
http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf , p.2
以上GPU原理(當然不只) ,外加整合CPU ,
然而就有了CUDA的coding環境出現。
使用CUDA必須注意的事情
• 使用哪一個NVIDIA GPU Architecture 。
• NVIDIA Tesla K20c
• 從https://developer.nvidia.com/cuda-gpus 可知Tesla K20c的
Compute Capability為3.5 。
• 安裝CUDA環境,可參考http://docs.nvidia.com/cuda/cuda-getting-
started-guide-for-linux/index.html#axzz33nDhVV00 。編譯器名稱為
nvcc 。
• 最新的CUDA版本為6.0 ,而我安裝的是5.0 XD(懶得升級 哈) 。
• 安裝完CUDA環境,可跑內建執行檔deviceQuery 去看看安裝對不
對。
/usr/local/cuda-5.0/samples/1_Utilities/deviceQuery
你會有個疑問 那我同一個CUDA他如何做到同
一個GPU不同SMX數也可以執行?
Block Scalability
Program Compilation
CUDA 5: Separate Compilation & Linking
From Introducing CUDA 5.pdf
Makefile範例
##########################################################
# compiler setting
##########################################################
CC = gcc
CXX = g++
NVCC = nvcc
CFLAGS = -g -Wall
CXXFLAGS = $(CFLAGS) -Weffc++ -pg
LIBS = -lm -lglut -lGLU -lGL
INCPATH = -I./
OBJS = main.o
c_a.o
c_b.o
cpp_a.o
cpp_b.o
cu_a.o
cu_b.o
EXEC = output
all: $(OBJS)
$(NVCC) $(OBJS) -o $(EXEC) $(LIBS) -pg
%.o:%.cu
$(NVCC) -c $< -o $@ -g –G -arch=sm_35
%.o:%.cpp
$(CXX) -c $(CXXFLAGS) $(INCPATH) $< -o $@
%.o:%.c
$(CC) -c $(CFLAGS) $(INCPATH) $< -o $@
#########################################################
假設拿到別人的平行化程式,
可試試看一個不錯可能改善效能的方法。
The ILP method <=小時候學的ILP可以這樣
用啊!!
• 多條thread合併->ILP增加 -> 有機會對coalesce global memory-> Block數減少 -
> 一個thread使用register個數增加 -> Ocuupancy降低
(Vasily Volkov, “Better Performance at Lower Occupancy”)
先說什麼是Occupancy
• Occupancy = Number of warps running concurrently on a
multiprocessor divided by maximum number of warps that can run
concurrently.(意思就是說你每個時間所同時跑的thread數,到底有
沒有塞滿GPU提供的最大同時間跑的thread數。)
From Optimizing CUDA – Part II © NVIDIA Corporation 2009
• 假設某GPU的其中一個SMX最
多同時間可跑1536個threads以
及32K register
NVIDIA工程師
(http://stackoverflow.com/users/749748/harrism)
在stackoverflow表示
• In general, as Jared mentions, using too many registers per thread is
not desirable because it reduces occupancy, and therefore reduces
latency hiding ability in the kernel. GPUs thrive on parallelism and do
so by covering memory latency with work from other threads.
• Therefore, you should probably not optimize arrays into registers.
Instead, ensure that your memory accesses to those arrays across
threads are as close to sequential as possible so you maximize
coalescing (i.e. minimize memory transactions).
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
也就是說不管Occupancy高不高,要讓
memory有機會能coalesce來讀取。
繼續對ILP在NVIDIA GPU影響做說明
http://continuum.io/blog/cudapy_ilp_opt
搬資料
• Core
• Memory
controller
上面的效果對應CODE是什麼啊?
ILP = 2時,右邊用
pseudocode表示
# read
i = thread.id
ai = a[i]
bi = b[i]
j = i+5
aj = a[j]
bj = b[j]
# compute
ci = core(ai, bi)
cj = core(aj, bj)
# write
c[i] = ci
c[j] = cj
ILP=4時,實際效果=>讓GPU pipeline效果變
高
http://continuum.io/blog/cudapy_ilp_opt
上述主要概念整理
•Hide latency = do other operations when
waiting for latency
• ILP增加
• 增加Occupancy
剛提到the ILP method ,
一個thread 所使用的register個數是一個重要考
量。
Interpreting Output of --ptxas-options=-v
http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v
http://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda
• Each CUDA thread is using 46 registers?
Yes, correct
• There is no register spilling to local memory(shared memory)?
Yes, correct
• Is 72 bytes the sum-total of the memory for the stack frames of the __global__ (撰寫
平行化的副程式)and __device__(給__global__函數呼叫的副程式) functions?
Yes, correct
我要怎麼限制一個thread的register使用數
• control register usage with the nvcc flag: --maxrregcount
假設threads的分配register總量超過GPU上的
register數量,編譯器會怎做?
stackoverflow神人表示
• PTX level allows many more virtual registers than the hardware.
Those are mapped to hardware registers at load time. The register
limit you specify allows you to set an upper limit on the hardware
resources used by the generated binary. It serves as a heuristic for the
compiler to decide when to spill (see below) registers when compiling
to PTX already so certain concurrency needs can be met.
• For Fermi GPUs there are at most 64 hardware registers. The 64th is
used by the ABI as the stack pointer and thus for "register spilling" (it
means freeing up registers by temporarily storing their values on the
stack and happens when more registers are needed than available) so
it is untouchable.
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
剛剛說利用增加register來賺memory coalesce的
時間。 register用超過會增加memory存取時間。
怎辦啊?
哈! 再怎嘴砲,也是要coding才知阿~~~~~
我可以寫程式把所需資料放在哪呢?
Mohamed Zahran,
“Lecture 6: CUDA Memories”
• 存取速度
shared memory >
constant memory >
global memory >
要怎宣告的資料,代表存取哪種memory啊?
描述有錯,要看
compiler放在哪裡
Stackoverflow神人
• Dynamically indexed arrays cannot be stored in registers, because the GPU
register file is not dynamically addressable.
• Scalar variables are automatically stored in registers by the compiler.
• Statically-indexed (i.e. where the index can be determined at compile
time), small arrays (say, less than 16 floats) may be stored in registers by the
compiler.
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
來看一個簡單的範例
Summing two vectors
Jason Sa nders, Edward Kandrot, “CUDA by Example”
資料哪來啊? 從CPU Memory搬到global
memory
Jason Sa nders, Edward Kandrot, “CUDA by Example”
怎麼呼叫自己寫的平行化程式押?
• 呼叫時需要指定每個block有thread數,一個grid有多少block
• 上面意思是說一個grid有N個blocks ,每個block有1個thread再
執行
threadsblocks
Jason Sa nders, Edward Kandrot, “CUDA by Example”
從GPU global memory寫回到CPU memory去
處理
Jason Sa nders, Edward Kandrot, “CUDA by Example”
整理以上流程
http://en.wikipedia.org/wiki/CUDA
為什麼要指定的thread數block數會有
1D,2D,3D阿?
• 1 block 4
• 一個block是9x9,因為
100 thread所以有兩
個block
• 2 blocks
在thread數不是32倍數的狀況下,1D,2D,3D
的分法就是要比較哪個warp塞比較滿!!!
要怎量GPU跑的時間
Profiling Tool: nvprof
nvprof --events warps_launched,threads_launched ./執行檔 執行檔輸入參數 >
result
Q&A
Q&A-1: flow divergence的討論
• JIT的作法
• 程式用profile知道哪些true或false的狀況,分開同時丟給JIT去執行
• Brower就是用這樣的方式去加快處理
• 這樣的做法很吃memory
Q&A-2: NVIDA/AMD
• NVIDA
• 筆電,伺服器
• AMD
• 手機
Q&A-3:Single Instruction, Multiple Addresses
的討論
• 對於compiler處理random access
• Point analysis
Q&A-4:
• CUDA LLVM Compiler
• 目前CUDA不支援OpenCL 2.0
• https://developer.nvidia.com/opencl
Q&A-5: trace code討論
• cuda-gdb
• http://docs.nvidia.com/cuda/cuda-gdb/#axzz34ufkPsqt
• EX:
• Note: For disassembly instruction to work
properly, cuobjdump must be installed and present in your $PATH.
Q&A-6: GPU machine code放到哪執行阿?
 不知道GPU有沒有在討論locality問題?
Q&A-7 把function切開平行化是否有好處?
• Function()
function1()
function2()
function3()
• ?
Q&A-8 5 axis machine 的防碰撞平行化
• cutter每走一步就用GPU檢查有沒有撞到
• 問題: GPU持續耗電
• 如果5軸機開雕刻一整天 GPU不就耗電很恐怖?
• Trade off: 耗電/速度
CUDA Toolkit Documentation
• http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9
1 of 69

Recommended

Cuda optimization by
Cuda optimizationCuda optimization
Cuda optimizationCHIHTE LU
504 views62 slides
Qemu JIT Code Generator and System Emulation by
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationNational Cheng Kung University
15.8K views58 slides
Incognito 2016 - IoT 펌웨어 추출과 분석 by
Incognito 2016 - IoT 펌웨어 추출과 분석Incognito 2016 - IoT 펌웨어 추출과 분석
Incognito 2016 - IoT 펌웨어 추출과 분석Benjamin Oh
5.7K views107 slides
Blosc Talk by Francesc Alted from PyData London 2014 by
Blosc Talk by Francesc Alted from PyData London 2014Blosc Talk by Francesc Alted from PyData London 2014
Blosc Talk by Francesc Alted from PyData London 2014PyData
3.4K views38 slides
Kernel Recipes 2019 - Faster IO through io_uring by
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
29.2K views55 slides
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt by
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
5.6K views123 slides

More Related Content

What's hot

淺談探索 Linux 系統設計之道 by
淺談探索 Linux 系統設計之道 淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道 National Cheng Kung University
11.2K views91 slides
Concurrency in action - chapter 7 by
Concurrency in action - chapter 7Concurrency in action - chapter 7
Concurrency in action - chapter 7JinWoo Lee
1.9K views64 slides
How shit works: the CPU by
How shit works: the CPUHow shit works: the CPU
How shit works: the CPUTomer Gabel
1.8K views38 slides
PostgreSQL Query Cache - "pqc" by
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"Uptime Technologies LLC
12.9K views16 slides
Linux SD/MMC device driver by
Linux SD/MMC device driverLinux SD/MMC device driver
Linux SD/MMC device driver艾鍗科技
13.4K views31 slides
Haswellサーベイと有限体クラスの紹介 by
Haswellサーベイと有限体クラスの紹介Haswellサーベイと有限体クラスの紹介
Haswellサーベイと有限体クラスの紹介MITSUNARI Shigeo
5.4K views43 slides

What's hot(20)

Concurrency in action - chapter 7 by JinWoo Lee
Concurrency in action - chapter 7Concurrency in action - chapter 7
Concurrency in action - chapter 7
JinWoo Lee1.9K views
How shit works: the CPU by Tomer Gabel
How shit works: the CPUHow shit works: the CPU
How shit works: the CPU
Tomer Gabel1.8K views
Linux SD/MMC device driver by 艾鍗科技
Linux SD/MMC device driverLinux SD/MMC device driver
Linux SD/MMC device driver
艾鍗科技13.4K views
Haswellサーベイと有限体クラスの紹介 by MITSUNARI Shigeo
Haswellサーベイと有限体クラスの紹介Haswellサーベイと有限体クラスの紹介
Haswellサーベイと有限体クラスの紹介
MITSUNARI Shigeo5.4K views
Understand more about C by Yi-Hsiu Hsu
Understand more about CUnderstand more about C
Understand more about C
Yi-Hsiu Hsu3.9K views
IntelON 2021 Processor Benchmarking by Brendan Gregg
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg1K views
Integrated Register Allocation introduction by Shiva Chen
Integrated Register Allocation introductionIntegrated Register Allocation introduction
Integrated Register Allocation introduction
Shiva Chen772 views
MySQL innoDB split and merge pages by Marco Tusa
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
Marco Tusa337 views
Advanced heap exploitaion by Angel Boy
Advanced heap exploitaionAdvanced heap exploitaion
Advanced heap exploitaion
Angel Boy27.1K views
The Linux Block Layer - Built for Fast Storage by Kernel TLV
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
Kernel TLV4.3K views
Sigreturn Oriented Programming by Angel Boy
Sigreturn Oriented ProgrammingSigreturn Oriented Programming
Sigreturn Oriented Programming
Angel Boy26.4K views
LinuxCon 2015 Linux Kernel Networking Walkthrough by Thomas Graf
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf26.2K views
Thrift vs Protocol Buffers vs Avro - Biased Comparison by Igor Anishchenko
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko240.7K views
Project ACRN hypervisor introduction by Project ACRN
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction
Project ACRN170 views
Memory Mapping Implementation (mmap) in Linux Kernel by Adrian Huang
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
Adrian Huang2.1K views

Viewers also liked

[SITCON2015] 自己的異質多核心平台自己幹 by
[SITCON2015] 自己的異質多核心平台自己幹[SITCON2015] 自己的異質多核心平台自己幹
[SITCON2015] 自己的異質多核心平台自己幹Aj MaChInE
2.6K views42 slides
[MOSUT] Format String Attacks by
[MOSUT] Format String Attacks[MOSUT] Format String Attacks
[MOSUT] Format String AttacksAj MaChInE
2.6K views16 slides
[若渴計畫]64-bit Linux Return-Oriented Programming by
[若渴計畫]64-bit Linux Return-Oriented Programming[若渴計畫]64-bit Linux Return-Oriented Programming
[若渴計畫]64-bit Linux Return-Oriented ProgrammingAj MaChInE
2.2K views24 slides
閱讀文章分享@若渴 2016.1.24 by
閱讀文章分享@若渴 2016.1.24閱讀文章分享@若渴 2016.1.24
閱讀文章分享@若渴 2016.1.24Aj MaChInE
1.3K views17 slides
[若渴計畫2015.8.18] SMACK by
[若渴計畫2015.8.18] SMACK[若渴計畫2015.8.18] SMACK
[若渴計畫2015.8.18] SMACKAj MaChInE
1.2K views25 slides
Code GPU with CUDA - SIMT by
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMTMarina Kolpakova
2.2K views19 slides

Viewers also liked(11)

[SITCON2015] 自己的異質多核心平台自己幹 by Aj MaChInE
[SITCON2015] 自己的異質多核心平台自己幹[SITCON2015] 自己的異質多核心平台自己幹
[SITCON2015] 自己的異質多核心平台自己幹
Aj MaChInE2.6K views
[MOSUT] Format String Attacks by Aj MaChInE
[MOSUT] Format String Attacks[MOSUT] Format String Attacks
[MOSUT] Format String Attacks
Aj MaChInE2.6K views
[若渴計畫]64-bit Linux Return-Oriented Programming by Aj MaChInE
[若渴計畫]64-bit Linux Return-Oriented Programming[若渴計畫]64-bit Linux Return-Oriented Programming
[若渴計畫]64-bit Linux Return-Oriented Programming
Aj MaChInE2.2K views
閱讀文章分享@若渴 2016.1.24 by Aj MaChInE
閱讀文章分享@若渴 2016.1.24閱讀文章分享@若渴 2016.1.24
閱讀文章分享@若渴 2016.1.24
Aj MaChInE1.3K views
[若渴計畫2015.8.18] SMACK by Aj MaChInE
[若渴計畫2015.8.18] SMACK[若渴計畫2015.8.18] SMACK
[若渴計畫2015.8.18] SMACK
Aj MaChInE1.2K views
[若渴計畫] Studying Concurrency by Aj MaChInE
[若渴計畫] Studying Concurrency[若渴計畫] Studying Concurrency
[若渴計畫] Studying Concurrency
Aj MaChInE4K views
大學部101級專題 cuda by 迺翔 黃
大學部101級專題 cuda大學部101級專題 cuda
大學部101級專題 cuda
迺翔 黃1.1K views
[MOSUT20150131] Linux Runs on SoCKit Board with the GPGPU by Aj MaChInE
[MOSUT20150131] Linux Runs on SoCKit Board with the GPGPU[MOSUT20150131] Linux Runs on SoCKit Board with the GPGPU
[MOSUT20150131] Linux Runs on SoCKit Board with the GPGPU
Aj MaChInE1.3K views
Introduction to gpu architecture by CHIHTE LU
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
CHIHTE LU746 views
圖形處理器於腦部核磁共振影像處理應用 by NVIDIA Taiwan
圖形處理器於腦部核磁共振影像處理應用圖形處理器於腦部核磁共振影像處理應用
圖形處理器於腦部核磁共振影像處理應用
NVIDIA Taiwan2.8K views

Similar to [若渴計畫]由GPU硬體概念到coding CUDA

Using GPUs to handle Big Data with Java by Adam Roberts. by
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
876 views43 slides
lecture11_GPUArchCUDA01.pptx by
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
9 views66 slides
[Osxdev]metal by
[Osxdev]metal[Osxdev]metal
[Osxdev]metalNAVER D2
1.1K views29 slides
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA by
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
1.3K views88 slides
Transparent GPU Exploitation for Java by
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
1.2K views50 slides
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan by
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
665 views74 slides

Similar to [若渴計畫]由GPU硬體概念到coding CUDA(20)

Using GPUs to handle Big Data with Java by Adam Roberts. by J On The Beach
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach876 views
lecture11_GPUArchCUDA01.pptx by ssuser413a98
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a989 views
[Osxdev]metal by NAVER D2
[Osxdev]metal[Osxdev]metal
[Osxdev]metal
NAVER D21.1K views
Transparent GPU Exploitation for Java by Kazuaki Ishizaki
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
Kazuaki Ishizaki1.2K views
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan by Jimin Hsieh
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Jimin Hsieh665 views
[CB20] Vulnerabilities of Machine Learning Infrastructure by Sergey Gordeychik by CODE BLUE
[CB20] Vulnerabilities of Machine Learning Infrastructure by Sergey Gordeychik[CB20] Vulnerabilities of Machine Learning Infrastructure by Sergey Gordeychik
[CB20] Vulnerabilities of Machine Learning Infrastructure by Sergey Gordeychik
CODE BLUE107 views
Introduction to cuda geek camp singapore 2011 by Raymond Tay
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
Raymond Tay955 views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh396 views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1.1K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1 view
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh441 views
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs by AMD Developer Central
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
SMP implementation for OpenBSD/sgi by Takuya ASADA
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
Takuya ASADA1.2K views
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015 by curryon
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
curryon9.8K views
lecture_GPUArchCUDA02-CUDAMem.pdf by Tigabu Yaya
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya3 views
QCon2016--Drive Best Spark Performance on AI by Lex Yu
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
Lex Yu538 views
GPU based password recovery on Linux. TXLF 2013 by Brad Richardson
GPU based password recovery on Linux. TXLF 2013GPU based password recovery on Linux. TXLF 2013
GPU based password recovery on Linux. TXLF 2013
Brad Richardson1.1K views

More from Aj MaChInE

An Intro on Data-oriented Attacks by
An Intro on Data-oriented AttacksAn Intro on Data-oriented Attacks
An Intro on Data-oriented AttacksAj MaChInE
301 views18 slides
A Study on .NET Framework for Red Team - Part I by
A Study on .NET Framework for Red Team - Part IA Study on .NET Framework for Red Team - Part I
A Study on .NET Framework for Red Team - Part IAj MaChInE
493 views28 slides
A study on NetSpectre by
A study on NetSpectreA study on NetSpectre
A study on NetSpectreAj MaChInE
211 views27 slides
Introduction to Adversary Evaluation Tools by
Introduction to Adversary Evaluation ToolsIntroduction to Adversary Evaluation Tools
Introduction to Adversary Evaluation ToolsAj MaChInE
1.2K views45 slides
[若渴] A preliminary study on attacks against consensus in bitcoin by
[若渴] A preliminary study on attacks against consensus in bitcoin[若渴] A preliminary study on attacks against consensus in bitcoin
[若渴] A preliminary study on attacks against consensus in bitcoinAj MaChInE
345 views46 slides
[RAT資安小聚] Study on Automatically Evading Malware Detection by
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware DetectionAj MaChInE
795 views71 slides

More from Aj MaChInE(12)

An Intro on Data-oriented Attacks by Aj MaChInE
An Intro on Data-oriented AttacksAn Intro on Data-oriented Attacks
An Intro on Data-oriented Attacks
Aj MaChInE301 views
A Study on .NET Framework for Red Team - Part I by Aj MaChInE
A Study on .NET Framework for Red Team - Part IA Study on .NET Framework for Red Team - Part I
A Study on .NET Framework for Red Team - Part I
Aj MaChInE493 views
A study on NetSpectre by Aj MaChInE
A study on NetSpectreA study on NetSpectre
A study on NetSpectre
Aj MaChInE211 views
Introduction to Adversary Evaluation Tools by Aj MaChInE
Introduction to Adversary Evaluation ToolsIntroduction to Adversary Evaluation Tools
Introduction to Adversary Evaluation Tools
Aj MaChInE1.2K views
[若渴] A preliminary study on attacks against consensus in bitcoin by Aj MaChInE
[若渴] A preliminary study on attacks against consensus in bitcoin[若渴] A preliminary study on attacks against consensus in bitcoin
[若渴] A preliminary study on attacks against consensus in bitcoin
Aj MaChInE345 views
[RAT資安小聚] Study on Automatically Evading Malware Detection by Aj MaChInE
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection
Aj MaChInE795 views
[若渴] Preliminary Study on Design and Exploitation of Trustzone by Aj MaChInE
[若渴] Preliminary Study on Design and Exploitation of Trustzone[若渴] Preliminary Study on Design and Exploitation of Trustzone
[若渴] Preliminary Study on Design and Exploitation of Trustzone
Aj MaChInE281 views
[若渴]Study on Side Channel Attacks and Countermeasures by Aj MaChInE
[若渴]Study on Side Channel Attacks and Countermeasures [若渴]Study on Side Channel Attacks and Countermeasures
[若渴]Study on Side Channel Attacks and Countermeasures
Aj MaChInE858 views
[若渴計畫] Challenges and Solutions of Window Remote Shellcode by Aj MaChInE
[若渴計畫] Challenges and Solutions of Window Remote Shellcode[若渴計畫] Challenges and Solutions of Window Remote Shellcode
[若渴計畫] Challenges and Solutions of Window Remote Shellcode
Aj MaChInE981 views
[若渴計畫] Introduction: Formal Verification for Code by Aj MaChInE
[若渴計畫] Introduction: Formal Verification for Code[若渴計畫] Introduction: Formal Verification for Code
[若渴計畫] Introduction: Formal Verification for Code
Aj MaChInE718 views
[若渴計畫] Studying ASLR^cache by Aj MaChInE
[若渴計畫] Studying ASLR^cache[若渴計畫] Studying ASLR^cache
[若渴計畫] Studying ASLR^cache
Aj MaChInE430 views
[若渴計畫] Black Hat 2017之過去閱讀相關整理 by Aj MaChInE
[若渴計畫] Black Hat 2017之過去閱讀相關整理[若渴計畫] Black Hat 2017之過去閱讀相關整理
[若渴計畫] Black Hat 2017之過去閱讀相關整理
Aj MaChInE434 views

Recently uploaded

Managing Github via Terrafom.pdf by
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdfmicharaeck
5 views47 slides
PB CV v0.4 by
PB CV v0.4PB CV v0.4
PB CV v0.4Pedro Borracha
7 views16 slides
PB CV v0.3 by
PB CV v0.3PB CV v0.3
PB CV v0.3Pedro Borracha
12 views16 slides
The Throne of Your Heart 11-26-23 PPT.pptx by
The Throne of Your Heart 11-26-23 PPT.pptxThe Throne of Your Heart 11-26-23 PPT.pptx
The Throne of Your Heart 11-26-23 PPT.pptxFamilyWorshipCenterD
5 views24 slides
falsettos by
falsettosfalsettos
falsettosRenzoCalandra
9 views48 slides
SOA PPT ON SEA TURTLES.pptx by
SOA PPT ON SEA TURTLES.pptxSOA PPT ON SEA TURTLES.pptx
SOA PPT ON SEA TURTLES.pptxEuniceOseiYeboah
9 views18 slides

Recently uploaded(20)

Managing Github via Terrafom.pdf by micharaeck
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdf
micharaeck5 views
Helko van den Brom - VSL by Dutch Power
Helko van den Brom - VSLHelko van den Brom - VSL
Helko van den Brom - VSL
Dutch Power87 views
Synthetic Biology.pptx by ShubNoor4
Synthetic Biology.pptxSynthetic Biology.pptx
Synthetic Biology.pptx
ShubNoor47 views
Post-event report intro session-1.docx by RohitRathi59
Post-event report intro session-1.docxPost-event report intro session-1.docx
Post-event report intro session-1.docx
RohitRathi5912 views
Christan van Dorst - Hyteps by Dutch Power
Christan van Dorst - HytepsChristan van Dorst - Hyteps
Christan van Dorst - Hyteps
Dutch Power89 views
Roozbeh Torkzadeh - TU Eindhoven by Dutch Power
Roozbeh Torkzadeh - TU EindhovenRoozbeh Torkzadeh - TU Eindhoven
Roozbeh Torkzadeh - TU Eindhoven
Dutch Power85 views
231121 SP slides - PAS workshop November 2023.pdf by PAS_Team
231121 SP slides - PAS workshop November 2023.pdf231121 SP slides - PAS workshop November 2023.pdf
231121 SP slides - PAS workshop November 2023.pdf
PAS_Team158 views
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn by NETWAYS
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp KrennOSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn
NETWAYS22 views
Gym Members Community.pptx by nasserbf1987
Gym Members Community.pptxGym Members Community.pptx
Gym Members Community.pptx
nasserbf19877 views

[若渴計畫]由GPU硬體概念到coding CUDA

Editor's Notes

  1. splashtop
  2. 當初有個議題
  3. 對於programming model的解釋, 我在上次研討會的時候,我覺得有一個不錯的解釋 使用此programming model 他會帶來給你怎樣的設計概念.
  4. 所以接下來就針對SIMT來說明
  5. 這裡就是在CUDA中,描述每條thread 要做的事情 i代表每條thread 某一條thread i B陣列的i元素 和c陣列的i元素 存到a陣列的i元素裡面 每條thread所占用的硬體資源會如右圖
  6. 每條tread i 取某個lut陣列的b[i]元素 這樣的意思是說 每個thread可以對memory 自己抓自己的memory address來處理.
  7. Warp表示一個一個warp 每個warp由32 threads組成 本來32條 threads執行在同一個warp同步執行 會變成兩個warp循序執行
  8. Latency hiding 當有等地latency hiding時 去做別的事情
  9. 在每個thread使用作多個register時, 可同時執行8個warp 所以如果設計成9個Dispatch的話,在這狀況下1個dispatch就多於了