[조진현]Kgc2012 c++amp

C++ Accelerated Massive Parallelism
with Visual Studio 2012

목차

 GPGPU 소개

 C++ AMP 소개

C++ AMP

General-Purpose computing on Graphics Processing Units

왜 지금까지 GPU를 활용하지 않은 것인가?

 GPU가 연산한 결과를 CPU에서 접근 가능한가?

 그래픽 파이프라인에서 인접한 정보를 얻어올 수 있는가?

 GPU의 연산 능력은 충분한가?

 하드웨어 지원이 필수

전통적인 실행 모델

CPU GPU

Direct3D
Main Memory Video Memory

Direct3D 11 그래픽 파이프라인

현재의 CPUs vs GPUs

CPU GPU
50GFlops 1TFlop
1GB/s

10GB/s 100GB/s

GPU RAM
CPU RAM 1 GB
4-6 GB

현재의 CPUs vs GPUs

 GPU는 CPU에 비해 매우 빠른 연산 능력을 소유

 CPU는 GPU에 비해 매우 큰 메모리를 사용

 GPU는 CPU에 비해 넓은 대역폭을 사용

 CPU와 GPU 사이의 데이터 전송은 매우 느림

GPGPU


 엄청난 수의 스레드를 통해서 연산.

 실수 연산에 최적화

 병렬 처리에 유용

DirectCompute

 MS 최초의 GPGPU 플랫폼 ( with HLSL )

 DirectX 10 부터 지원

 Direct3D API 를 이용해서 코딩

GPU Video Memory
(SIMD Engine )

GPU Video Memory
(SIMD Engine )

SimpleCS

GPU Video Memory
(SIMD Engine )

SimpleCS Buffer0( For Data )

GPU Video Memory
(SIMD Engine )

SimpleCS Buffer0( For Data )

Buffer1( For Result )

GPU Video Memory
(SIMD Engine )

SimpleCS SRV Buffer0( For Data )

Buffer1( For Result )

GPU Video Memory
(SIMD Engine )


UAV Buffer1( For Result )

GPU Video Memory
(SIMD Engine )


SIMD SIMD
UAV Buffer1( For Result )
SIMD SIMD

…

현재의 실행 모델

CPU GPU

Direct3D
Main Memory Video Memory

DirectCompute 의 문제점

 XP 미지원

 어렵고, 난해함 ( 게임 프로그래머에게 익숙한 개념들 )

 GPU 전용 프로그래밍( APU 배제 )

C++ AMP


 미래의 하드웨어 변화에도 용이하도록 설계

 Visual C++ 의 일부

 Visual Studio 2012 에 통합

 Direct3D 에서 구동 ( DX11 ~ )

 Performance, Productivity, Portability

C++ AMP Performance

 멀티-코어 CPU 보다 훨씬 빠름

 기존의 GPGPU 플랫폼들과 동일한 성능 향상 효과

C++ AMP Productivity

 최신 C++ 를 기반으로 제작 ( 템플릿 기반 구현 )

 Visual Studio 강력한 지원

C++ AMP Portability

 DirectX 11 드라이버를 지원하는 모든 GPU

 NVIDIA GPUs

 AMD GPUs, APUs

 Intel GPUs ( Ivy Bridge, … )

 ARM GPUs ( Mali design, … )

 GPU 가 사용 불가능일 경우, CPU로 실행

 AMD / Intel CPUs ( multi-core, SSE )

 ARM CPUs ( multi-core, NEON )

1. #include <iostream>
2.
3.

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8. for (int idx = 0; idx < 11; idx++)
9. {
10. v[idx] += 1;
11. }

12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>( v[i]);
14. }

2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
9. {
10. v[idx] += 1;
11. }

14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7. array_view<int> av(11, v);
9. {
10. v[idx] += 1;
11. }

14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

9. {
10. av[idx] += 1;
11. }

14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

9. {
10. av[idx] += 1;
11. }

13. std::cout << static_cast<char>(av[i]);
14. }

C++ AMP “Hello World”
2. #include <amp.h>

4. •int main() New -> Project
File ->
5. {
6.
• Empty=Project 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
int v[11] {'G', 'd', 'k',
• Project -> Add New Item
8. • Empty C++ file
parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9. {
10. av[idx] += 1;
11. });

13. std::cout << static_cast<char>(av[i]);
14. }

accelerator
병렬 연산에 최적화된 하드웨어 ( CPU 포함 )
Host Accelerator (e.g. discrete GPU)

CPUs
PCIe GPU
GPU

System memory GPU

GPU

accelerator_view

하드웨어의 추상화

 스케쥴링( scheduling )

 메모리 관리

개발자 입장에서의 실행

… …
…… ……
… …
Per Thread Registers

Global Memory

C++ AMP Thread 식별
C++ AMP 에서는 매우 많은 스레드들이 동시에 실행
 index class
 Thread ID

 extend class
 array 혹은 array_view 데이터 배열의 면적들에 대한 길이

C++ AMP 에서의 메모리
accelerator 상의 메모리에 존재
 concurrency::array
 데이터 컨테이너 ( deep copy ), 연속된 메모리 블럭
 array< T, N >, N <= 128 ( ex>array<float, 2> b( 4, 2 ); )

 concurrency::array_view
 데이터 랩퍼( STL의 iterator 과 유사 )
 array< T, N >

 Concurrency::graphics::texture,
Concurrency::graphics::writeonly_texture_view
 #include <amp_graphics.h>

restrict ( … )

 컴파일러에게 타겟을 알리는 역할

 현재 오직 두 가지만 구현( cpu, amp )

tiled thread

… …
…… ……
…
Per Thread
…
Per Thread
Registers Registers
Programmable Programmable
Cache Cache
Global Memory

tile_static
C++ AMP 의 데이터 저장을 목적으로 하는 클래스
 Programmable Cache
 group shared memory
 현재 하드웨어에서는 스레드 그룹별로 16~48KB 할당
 오직 restrict( amp ) 함수 내에서만 사용 가능
 무척 빠른 액세스
 메모리 전송의 최소화로 인한 성능 향상

tiled_index
array_view<int,2> data(2, 6, p_my_data);
parallel_for_each(
data.extent.tile<2,2>(),
[=] (tiled_index<2,2> t_idx)… { … });
col 0 col 1 col 2 col 3 col 4 col 5

row
0
row
1
T
 t_idx.global // index< 2 > ( 1, 3 )
 t.idx.local // index< 2 > ( 1, 1 )
 t.idx.tile // index< 2 > ( 0, 1 )
 t.idx.origin // index< 2 > ( 0, 2 )

tile_barrier

 타일 내 모든 스레드들을 동기화

 t_idx.barrier.wait();

 all_memory_fence, global_memory_fence,

tile_static_memory_fence

게임으로의 활용

 파티클 ( ex>ParticleSystemAPI )

참고 자료

 http://blogs.msdn.com/b/nativeconcurrency/

[조진현]Kgc2012 c++amp

More Related Content

What's hot

Viewers also liked

Similar to [조진현]Kgc2012 c++amp

More from 진현 조

[조진현]Kgc2012 c++amp