게임기술과
수퍼컴퓨팅의
공생관계
김태용, NVIDIA

taeyongk@nvidia.com
강연자에 대한 간략한 소개

컴퓨터 그래픽 공부
헐리웃 영화효과에서 사용되는 특수효과 기술개발
현재 엔비디아에서 게임용 특수효과를 위해
GPU를 사용한 병렬처리 알고리즘들 연구중
들어가는 말

컴퓨터는 더 이상 빨라지지 않는다
(다만 넓어질 뿐이다)
병렬 컴퓨팅의 간략한 역사
2012
18,688 개의
노드를 사용한
AMD Opteron/
NVIDIA Tesla
Titan 시스템
( ORNL )

1985
최초의 분산메모리
병렬시스템 (Intel
iPSC/1, 32...
컴퓨팅의 새로운 벽
• Clock speed 에 의존하는
단일코어 CPU 성능의 한계
• 발열
• 에너지
" 파티는 아직 끝나지 않았다.
하지만, 경찰이 찾아왔고 음악은 이미
멈췄다. “
– Peter Kogge -
문제는 전력!

=

Jaguar (Nov. ‘11)
2.3 petaflops @ 7 megawatts
(224,256 x86 CPU cores)

7,000 Homes = 7 megawatts
조그만 도시의 총 전력사...
120 petaflops | 376 megawatts
샌프란시스코의 총전력사용량!
난세의 영웅~
두둥!

©

Unreal Engine 4 Epic Games
GPU (Graphic Processing Unit, 그래픽
처리장치)
날로 번창하는 게임산업이 수퍼컴퓨터기술의 투자자

2016년까지 게임산업의 규모는 $82
billion (87조원) 예상
현재의 실시간 그래픽기술은 수십억개의 병렬
연산을 필요로 함
게임을 하는데 왜 그렇게 많은
병렬연산이 필요한가요?

수백만개의 삼각형들

수백만개의 픽셀들

Image plane

입력된 삼각형

버텍스 변환

표면 분할
(Tessellation
)

Camera

카메라 변환
...
게임에서 사용되는 기술은 더이상
게임만을 위한것이 아니다?
수퍼컴퓨팅용 과학연산은 일초에 수천조 (quadrillions)
단위의 병렬연산 필요
예) 왜 오늘 날씨예보가 잘 안맞나요?
GPU = 주용도는 그래픽, 하지만 내부는
수퍼컴퓨터

CPU = 몇개의 복잡한
연산전문 프로세서들

GPU = 간단한 연산에 최적화된
수많은 프로세서들

범용연산을 단일쓰레드로
최대의 성능을 낼수 있게
설계

범용연산...
GPU의 진화과정
“Kepler”
7B xtors
GeForce 8800
681M xtors

RIVA 128
3M xtors

1995

GeForce 256
23M xtors

2000

고정연산처리

GeForce...
무어의 법칙
• 18개월 마다 (혹은 2년마다)
회로의 직접도 2배 증가
• CPU 의 경우 지난 수십년간
단일코어 성능 향상
• “The Power Wall”
무어의 법칙 (GPU의 경우)
• 18개월 마다 (혹은 2년마다)
회로의 직접도 2배 증가
• 늘어난 회로의 직접도는
코어갯수의 증가로 이어짐
• 수백-수천개의 ‘코어’로 구성
“새로운” 무어의 법칙
• 코어 하나의 성능향상은 더이상 없다, 다만 코어의 갯수만
늘어날 뿐이다
• 성능향상을 위해 새로운 병렬 알고리즘 개발필요
• 새로운 다중코어 환경에서 적합한 Data-parallelism 필요
게임기법과 수퍼컴퓨팅알고리즘의
공통분모
1 입자 움직임계산 (PARTICLE SIMULATION)
입자움직임 계산 — 게임의 경우
예) 머리카락 시뮬레이션

NVIDIA Hair Demo
입자움직임 계산 — 게임의 경우
옷 시뮬레이션

Vertical
©

Samaritan demo Epic Games /NVIDIA Apex Clothing

Horizontal

Shear
입자움직임 계산— 분자생물학

리보솜 계산 (simulated by NAMD, visualized by VMD)
새로운 약품개발연구 및 기존 생물학 이론 검증

Bond

Atom

Forces
게임연산과 수퍼컴퓨팅의 공통분모 첫번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 최대의 성능을 위한 로직 병렬화 작업
• 독립된 연산을 막는 의존성문제
• 예) 입자연산의 경우, 입자간의 상호의존성

• 동...
데이터 충돌 방지를 위한 기법들

• 쓰기 연산이 같은 노드에 사용되는 것 방지
• 원자적 (atomic) 연산 -> 연산의 직렬화 -> 느림
• Graph Coloring을 통한 패스의 생성
• 각각의 패스에서는 완전...
2 컨볼루션 (신호처리, 이미지 프로세싱 등)

0
0
0
입력
픽셀값

0
0
0
0

0
1

1
1
0
0
0

0
1
2
2

1
1
1

0
1
2
2
2
1
1

0
1
2
2

2
1
1

0
0
1
1
1...
컨볼루션— 게임의 경우

카메라 초점효과 (Depth of field)

In focus

©

Halo 3 Bungie Studios
컨볼루션— 게임의 경우
뽀사시 효과 (Film bloom)

과도한 빛이
광원주위로
산란되는 현상

©

Crysis Crytek GmbH
CONVOLUTION — 게임의 경우

반투명한 피부질감 렌더링 (Subsurface scattering)

불투명한 오브젝트의
경우

반투명한 물체의 내부
광원 산란효과

Jimenez 2008
Used In Sama...
컨볼루션 — 수퍼컴퓨팅의 경우

지질탐사 (유정위치 탐색등)에서 사용되는 Reverse Time Migration
파동시뮬레이션의 정확도를 돕이기 위해 가변량계산시 convolution사용

Petroleum Geo S...
게임연산과 수퍼컴퓨팅의 공통분모 두번째

• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 컨볼루션 연산시 픽셀위치에 가까운 픽셀들 데이터 사용
• 게임의 경우 – 텍스쳐 캐쉬를 이용한 성능향상

• 일반연산의 경우...
3 편미분 방정식 (PARTIAL DIFFERENTIAL
EQUATIONS, PDEs)
유체역학에서의 편미분방정식 계산

속도

압력
게임에서의 유체효과를 위한 물리연산
수퍼컴퓨터에서의 유체역학

기상예측, 제품 디자인, 공기역학설계 등

On the Development of a High-Order, Multi-GPU Enabled, Compressible
Viscous Flow So...
게임연산과 수퍼컴퓨팅의 공통분모 세번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 수학연산 등 높은 계산량 (FLOPS) 이 필요한 분야에 사용
• 게임

• 정해진 시간안에 최대한의 효과
• 속도 > 정확도...
4. 푸리에 변환 (FFT)

+
=

+
FFT — 게임의 경우

렌즈 플레어 등의 후처리 효과

HDRI 등에서 발생하는
고휘도 픽셀입력값

FFT

Frequent
주파수 영역 이미지
Domain
(Frequency Domain
Image )
Image

...
©

3D Mark 11 Futuremark
©

3D Mark 11 Futuremark
©

3D Mark 11 Futuremark
FFT — 게임의 경우
파도 시뮬레이션

NVIDIA Ocean Demo
FFT — 수퍼컴퓨팅의 경우
난류 시뮬레이션 (Turbulence simulation), 단백질 합성, 분자동역학, 영상처리, 암호학
게임연산과 수퍼컴퓨팅의 공통분모 네번째
• 자주사용되는 핵심 기술 (FFT, 선형대수등)에 대한
라이브러리들 존재
• 게임
• 주로 게임용 개발환경인 DirectX/Direct Compute사용
• 게임엔진및 독립적인 ...
수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성
메모리연산에 의한 병목의 경우들 (메모리 사용량 > 계산량)

Gaming

HPC

Ambient occlusion

Sparse Matrix vector multiply
수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성
수학연산에 의한 병목의 경우들 (계산량 > 메모리 사용량)

©

Team Fortress 2 Valve

Gaming

복잡한 광원효과 계산

AMBER를 사용한 혈액 응집...
게임연산과 수퍼컴퓨팅의 공통분모 마지막

• 병렬처리 성능의 소프트웨어적인 개선 방법

• 병목을 찾아라
• 메모리 병목 (bandwidth bound) vs 계산 병목 (compute
bound)?
• 하드웨어에 대한...
GPU Computing – Game On!
Growth of GPU Computing
100M

430M

CUDA –Capable GPUs

CUDA-Capable GPUs

150K

1.6M

CUDA Downloads

CUDA Downloads

50
...
2008 Supercomputing Exhibit Floor
2012 Supercomputing Exhibit Floor
ADVANCING HEALTHCARE

The Chinese Academy of Sciences used
a GPU-powered supercomputer to
model a complete H1N1 virus for ...
ENHANCING FINANCE

With NVIDIA Tesla GPUs, J.P Morgan
.
achieved a 40x speedup of its risk
calculations.
MEDICAL BREAKTHROUGHS

GPUs enable doctors to perform beating
heart surgery with robotic arms that predict
and adjust for ...
정보 검색
Tweets Per Day
500

Millions

400

500M

300

Tweets
200

1M
Expressions

100

<5

0

Minutes
2007

2008

2009

2010...
오디오 검색
SHAZAM
300M

300M

User Queries (per month)

200M

100M

0

2008

2009

2010

2011

2012

2013
패턴 검색
게임 효과의 미래?

현재

가까운 미래
수퍼컴퓨팅의 미래?

현재

미래에는?
수퍼컴퓨팅의 또다른 미래?

현재

미래?
Mobile GPUs are coming

Relative Graphics Horsepower

300X

Mobile
Kepler
8800 GTX

200X

PS3
100X

iPad 4

iPhone 4

Gala...
Gesture

Augmented Reality

Beautification

Speech Recognition

Facial Recognition

모바일 수퍼컴퓨터?
http://nvidiakoreapsc.com
Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계
Upcoming SlideShare
Loading in …5
×

Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계

1,779 views

Published on

그래픽작업과 게임을 위해 발전해온 그래픽 프로세서 (GPU)는 최근에 수퍼컴퓨터에 없어서는 안될 핵심 요소로 자리 잡고 있습니다.
CUDA등의 범용 개발툴은 거의 모든 종류의 프로그래밍문제에 다양한 방식으로 적용되고 있습니다.
그런데 게임기술개발과 HPC 프로그래밍은 과연 정말 전혀 다른 영역일까요?
이 세션에서는 GPU를 활용한 속도 최적화를 위해 병렬처리와 관련한 알고리즘 적인 접근방식이 게임이나 그래픽과 수퍼컴퓨터에서 사용되는 기술간에 어떤 유사성이 있는지에 대해 논의 합니다.
또 게임개발에서 습득한 여러 GPU 기술들이 다른 문제들에 어떻게 적용, 확장될수 있는지도 설명합니다.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,779
On SlideShare
0
From Embeds
0
Number of Embeds
686
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • Parallel computing enjoyed a heyday in the late 80s and early 90s.The first distributed memory parallel computer, consisting of 32 Intel CPUs, was delivered to ORNL in 1985.For the next 10 years, proponents of parallel computing built massively parallel machines -- exotic, expensive and accessible only to a relatively small, technicalelite.Since then, these machines have been replaced by clusters of commodity hardware. One example,last year the Titan supercomputer came online at Oak Ridge National Labs, It’s powered by more than 18 thousand nodes consisting of AMD Opteron CPUs and NVIDIA Kepler GPUs.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • To get a sense of the importance of this issue lets look at an example. The Jaguar supercomputer last year delivered 2.3 petaflops using 7 megawatts, which is the power consumed by a small city.What if we want to scale this system to deliver 120 petaflops?
  • It turns out that to deliver 120 petaflops we need more than 370 megawatts of power, which is enough to power all of San Francisco.So clearly, this approach is unsustainable – we need to be able to deliver higher and higher amounts of flops for HPC, but we need to do it with significantly less amount of power
  • Fortunately there is good news – high performance computing has an unlikely hero – the video game industry…
  • and the GPUs that power it.
  • As you all know, GPUs have a day job, which is rendering computer graphics for the massive and growing video game industry.This day job is pretty demanding.Gamers demand higher and higherfidelity, instantaneous response and immersive experiences.The desire for better computer graphics is insatiable and drives the industry forward.As I mentioned, this is a huge industry. Just as a data point, it is expected to reach 82 billion by 2016, which is about the size of the movie industry. The economies of scale derived from such a large base have enabled GPUs to fund R&amp;D into parallel computing, which has and continues to benefit the HPC community. Photo: 275,000 people attended Gamescom -- world’s largest gaming event -- held in Cologne, Germany, in 2011. Data: Gaming market $82B (does not include hardware): PriceWaterhouseCoopers, “Global entertainment and media outlook: 2012–2016”
  • So why is it that computer graphics requires so much parallel computing?In order to answer that, lets take a look at just one example – one frame rendered from unreal engine 4
  • In order to calculate the color of each of the millions of pixels on the screen we need to start with the 3D representation of the scene -Millions of independent trianglesFor each triangle we need to transform it, tessellate it, clip it, project it to 2D space, rasterize it to pixels and finally shade each pixel, and for each location on the screen we are shading multiple overlapping pixelsThe shading itself has to take into account complex effects such as the interaction of light with different materials, and then we need to apply a host of environmental and post processing effects.All of these calculations over all the triangles and all the pixels are done in parallel, and they have to be done in less than a 30th of a second
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • It’s also easy to see why HPC requires quadrillions of parallel computations -- we are trying to simulate and compute massive systems, whether it’s climate over large regions of the earth, or a detailed simulation over the wing of an airplane.As the most energy-efficient parallel processors, GPUs accelerate energy simulation, protein folding, DNA alignment, climate modeling, astrophysics, seismic exploration, computational finance, radioastronomy, heart surgery…the list goes on and on.
  • To get an idea of why GPUs are better at delivering performance per watt lets look at an architectural representation of the two.CPUs have a few cores and are optimized for delivering fast single threaded performance. This means - they are running each core at a higher frequency which costs a lot more power - They are spending a lot more power on scheduling the instruction than executing it. As an example an out of order core spends 2nJ to schedule a 25pJ FMUL (80x more energy). So a lot of power is going into things like Speculative execution, out of order execution and branch prediction and only a tiny fraction is going into flopsIn contrast GPUs have lots of tiny cores that are running at a lower clock, and each core is devoting most of its area to flops rather than instruction scheduling etc. This means we get slow single threaded performance with longer latencies, but we have a lot more cores to improve the overall throughput.CPUs:~50X energy to schedule an instruction than to execute itTodo:Get the energy to schedule instructions for the GPUAdd an equivalent bullet point for memory bandwidth
  • GPUs have always been massively parallel throughput machines, but both their processing power and programmability evolved over time. Lets take a quick look at the evolution of the GPUStarting from 1995 with the RIVA 128 with 3M transistors, we have come a long way in 15 years, increasing the transistor count by more than 3 orders of magnitudeAt the same time as the computing power of GPUs has gone up, so has their programmability.The first couple of generations of GPUs were fixed functions, so writing any new algorithms on them meant tricks with multitexture, stencil buffer, depth buffer, blending, etc.Today writing HPC applications that take advantage of the GPU is extremely straightforward using CUDA, which is NVIDIA’s parallel programming platform, but this didn’t always used to be the case..In 2001, DirectX 8 introduced programmable pixel and vertex shaders, which made computing on GPUs a bit simpler – now people could write computing applications on GPUs using tricks like render to texture, and have each pixel be a unit of computation.Finally in 2006 the first CUDA capable card, the Geforce 8 was introduced, and this truly enabled GPUs to be used in a seamless and transparent way for any parallel computing workload without having to go through any graphics abstractions.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • In HPC particle simulation is used amongst other things for molecular dymamics.Molecular dynamics is a computer simulation of physical movements of atoms and molecules. NAMD is a powerful and widely used tool for simulating complex molecular systems and biomolecular processes on GPUs, and has applications in quickening the pace of drug discovery and other vital research in unraveling biological processes.
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • Perhaps the simplest, and the most popular approach to solve this problem is the Gauss-Seidel method.The idea is very simple. We apply the constraint correction for each constraint, and iterate through the constraints one-by-oneUntil things converge.This simple approach works pretty well in practice, but may not converge very well for large number of constraints.
  • The next example that we have is convolution. I am going to explain it in reference to an image in 2D but really it can be done in any amount of dimensions. We have a kernel here and it is translated over each pixel in the source image. At each location we do a pair wise multiplicatoin of the kernel and the source vlaiue sum these togegther and this is the value we place in the destination image.Source Left: http://www.biomachina.org/courses/structures/01.htmlSource Right: http://www.westworld.be
  • Convolution is useful for many effects in video games, for example depth of field. This is a cinematic effect which tries to mimic the properties of a real camera lens, where objects away from the focal point appear blurred. For example in this image the character in the front is in focus and everything is out of focus.This effect helps the game look more cinematic and also helps direct the user’s focus to the areas that the game developer would like them to focus on
  • Seismic imaging is a method of exploration geophysics that estimates the properties of the earth’s subsurface from reflected seismic waves. RTM is the current state of the art in seismic imaging. It is a computationally expensive technique that uses the full two-way acoustic wave equation instead of a simple one way propagation in order to better analyze complex situations, particularly sub salt. The image on the right for example shows complex wave interaction near a salt tooth which can be useful for trying to understand the oil content beneath the surface of the earth. The different colors denote different velocities and the concentric half circles show pressure waves. RTM in time domain uses convolution to compute discrete spatial derivatives in 3D by doing 1D convolutions in all dimensions of interestSource Left: http://www.pgs.com/ja/Pressroom/News/PGS_Releases_the_Ultimate_Dep/Source Right: http://www.acceleware.com/rtm
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • This PDE basically states that the rate at which some quantity x is changing over time is given by some function f, which may itself depend on x and t.To use this equation we will discretize our space (x) and time (t). After that there are many approaches to solving the equations at each position and time step.
  • This equation is specifying the evolution of the velocity field u, over time, t.We compute the solution to this equation by discretizing the space over a regular grid, initializing the velocity and pressure fields and then evolving them based on the three steps that we discussed earlier for each timestep. The first term is the advection term, which states that the velocity field pushes itself forward. The second term states that the velocity is affected by external forces f, like gravity. Finally, the projection operator P projects the field onto its divergence free part. This ensures that the fluid satisfies important properties, like the amount of fluid flowing into a region is the same as the amount of fluid flowing out of a section. In order to do this we need to calculate and use the pressure field.
  • The Navier-Stokes equations are used for a number of important applications in HPC, including modeling the weather, ocean currents, water flow in a pipe, blood flow and air flow around a wing. The particular example above is simulating the flow of air around the deformable wings of an insect or bird, and is done on multiple GPUs.P. Castonguay, D.M. Williams, P.E. Vincent, A. Jameson, On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids, 20th AIAA Computational Fluid Dynamics conference, AIAA 2011-3229
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • A Fourier Transform can decompose asignal in individual frequencies.
  • Fast Fourier Transform (FFT) is an important computational scheme commonly used to reduce the amount of overall calculation by transforming operations into spectral space. In particular, 3-D FFT is used in highperformance computing applications such as direct numerical simulations, Protein docking simulations, including cryptography, large polynomial multiplications, image and audio processing and Molecular dynamics. One particular example of the use of FFT is to simulate turbulent flows, which is central to studying everything from the formation of hurricanes to the mixing process in the chemical industry. An example of such work has been carried out by researchers at Peking University using the Tianhe-1A GPU supercomputer.
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • These use cases were just some examples. There are similarities between HPC and Gaming workloads at a very fundamental level, for example
  • Image on left: team fortressImage on right: blood coagulation factor IX simulated by AMBER (90K atoms including water)AMBER: A molecular dynamics package primarily designed for biomolecular systems such as proteins and nucleic acids.
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  • But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  • The Chinese Academy of Sciences uses GPU supercomputers to conduct some of the largest scale molecular simulations in the world. Recently the Academy became the first to model a complete H1N1 virus. The approach marked a new supercomputer-centric way of dealing with the problems of epidemiology and virology that wasn’t possible just a few years ago.
  • For investment banks, the ability to calculate risks across a range of complex variables quickly is critical to success. With NVIDIA Tesla GPUs, J.P. Morgan achieved a 40x speedup of its risk calculations.DATA: NV press release: http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&amp;version=live&amp;prid=784689&amp;releasejsp=release_157&amp;xhtml=true
  • This is truly a medical breakthrough…GPUs are helping doctors perform beating heart surgery.Performing beating heart surgery is extremely risky and can be done by only 2% of surgeons. Medical researchers at France’s LIRMM use GPUs to ‘virtually’ still a beating heartThisenables the surgeons to treat patients by guiding robotic arms that predict and adjust for movement.
  • The video game industry is driven by an insatiable demand for more realistic images and effects. The images that we are able to simulate and render today, for example the dust cloud on the left, are still far away from where we would like them to be, for example the image on the right. This improvement in fidelity is going to come both from an increase in computational throughput and an improvement in the algorithms, where gaming will be borrowing more and more from HPC.
  • HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.
  • HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.
  • Device: GFLOPS, Rel (yr)iPhone 4:1.6,1 (2010)Galaxy S2:10.8, 6.75 (2011)iPad 4:76, 47.5 (2012)[T4: 81, 50.6 (2013)]PS3:234,146.25 GT 550M:284,177.58800 GTX:346,216.25Mkepler: 384, 240
  • Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계

    1. 1. 게임기술과 수퍼컴퓨팅의 공생관계 김태용, NVIDIA taeyongk@nvidia.com
    2. 2. 강연자에 대한 간략한 소개 컴퓨터 그래픽 공부 헐리웃 영화효과에서 사용되는 특수효과 기술개발 현재 엔비디아에서 게임용 특수효과를 위해 GPU를 사용한 병렬처리 알고리즘들 연구중
    3. 3. 들어가는 말 컴퓨터는 더 이상 빨라지지 않는다 (다만 넓어질 뿐이다)
    4. 4. 병렬 컴퓨팅의 간략한 역사 2012 18,688 개의 노드를 사용한 AMD Opteron/ NVIDIA Tesla Titan 시스템 ( ORNL ) 1985 최초의 분산메모리 병렬시스템 (Intel iPSC/1, 32 CPUs) 이 오크리지(ORNL) 에서 사용 Image source: The Far Side, Gary Larson 1980s Today
    5. 5. 컴퓨팅의 새로운 벽 • Clock speed 에 의존하는 단일코어 CPU 성능의 한계 • 발열 • 에너지 " 파티는 아직 끝나지 않았다. 하지만, 경찰이 찾아왔고 음악은 이미 멈췄다. “ – Peter Kogge -
    6. 6. 문제는 전력! = Jaguar (Nov. ‘11) 2.3 petaflops @ 7 megawatts (224,256 x86 CPU cores) 7,000 Homes = 7 megawatts 조그만 도시의 총 전력사용량!
    7. 7. 120 petaflops | 376 megawatts 샌프란시스코의 총전력사용량!
    8. 8. 난세의 영웅~ 두둥! © Unreal Engine 4 Epic Games
    9. 9. GPU (Graphic Processing Unit, 그래픽 처리장치)
    10. 10. 날로 번창하는 게임산업이 수퍼컴퓨터기술의 투자자 2016년까지 게임산업의 규모는 $82 billion (87조원) 예상
    11. 11. 현재의 실시간 그래픽기술은 수십억개의 병렬 연산을 필요로 함
    12. 12. 게임을 하는데 왜 그렇게 많은 병렬연산이 필요한가요? 수백만개의 삼각형들 수백만개의 픽셀들 Image plane 입력된 삼각형 버텍스 변환 표면 분할 (Tessellation ) Camera 카메라 변환 Rasterize 칼라생성 (Shading)
    13. 13. 게임에서 사용되는 기술은 더이상 게임만을 위한것이 아니다?
    14. 14. 수퍼컴퓨팅용 과학연산은 일초에 수천조 (quadrillions) 단위의 병렬연산 필요 예) 왜 오늘 날씨예보가 잘 안맞나요?
    15. 15. GPU = 주용도는 그래픽, 하지만 내부는 수퍼컴퓨터 CPU = 몇개의 복잡한 연산전문 프로세서들 GPU = 간단한 연산에 최적화된 수많은 프로세서들 범용연산을 단일쓰레드로 최대의 성능을 낼수 있게 설계 범용연산은 느리나 연산총량 (throughput)에 맞춰 설계 단위전력당 성능 최적화
    16. 16. GPU의 진화과정 “Kepler” 7B xtors GeForce 8800 681M xtors RIVA 128 3M xtors 1995 GeForce 256 23M xtors 2000 고정연산처리 GeForce 3 60M xtors 2001 GeForce FX 250M xtors 2003 프로그래밍가능한 쉐이더 2006 2012 범용프로그래밍 가능 (CUDA)
    17. 17. 무어의 법칙 • 18개월 마다 (혹은 2년마다) 회로의 직접도 2배 증가 • CPU 의 경우 지난 수십년간 단일코어 성능 향상 • “The Power Wall”
    18. 18. 무어의 법칙 (GPU의 경우) • 18개월 마다 (혹은 2년마다) 회로의 직접도 2배 증가 • 늘어난 회로의 직접도는 코어갯수의 증가로 이어짐 • 수백-수천개의 ‘코어’로 구성
    19. 19. “새로운” 무어의 법칙 • 코어 하나의 성능향상은 더이상 없다, 다만 코어의 갯수만 늘어날 뿐이다 • 성능향상을 위해 새로운 병렬 알고리즘 개발필요 • 새로운 다중코어 환경에서 적합한 Data-parallelism 필요
    20. 20. 게임기법과 수퍼컴퓨팅알고리즘의 공통분모
    21. 21. 1 입자 움직임계산 (PARTICLE SIMULATION)
    22. 22. 입자움직임 계산 — 게임의 경우 예) 머리카락 시뮬레이션 NVIDIA Hair Demo
    23. 23. 입자움직임 계산 — 게임의 경우 옷 시뮬레이션 Vertical © Samaritan demo Epic Games /NVIDIA Apex Clothing Horizontal Shear
    24. 24. 입자움직임 계산— 분자생물학 리보솜 계산 (simulated by NAMD, visualized by VMD) 새로운 약품개발연구 및 기존 생물학 이론 검증 Bond Atom Forces
    25. 25. 게임연산과 수퍼컴퓨팅의 공통분모 첫번째 • 수많은 독립된 연산사용 ( 수백만 ~ 수억개) • 최대의 성능을 위한 로직 병렬화 작업 • 독립된 연산을 막는 의존성문제 • 예) 입자연산의 경우, 입자간의 상호의존성 • 동시에 여러개의 연산이 같은 메모리주소에 접근하는 문제 • 대응방법 • 병렬처리에 적합한 새로운 알고리즘 사용 • Graph Coloring을 통한 의존성제거, 다중 패스 적용
    26. 26. 데이터 충돌 방지를 위한 기법들 • 쓰기 연산이 같은 노드에 사용되는 것 방지 • 원자적 (atomic) 연산 -> 연산의 직렬화 -> 느림 • Graph Coloring을 통한 패스의 생성 • 각각의 패스에서는 완전히 독립적 -> 최대한의 병렬화 • 패스의 갯수를 줄이는게 관건
    27. 27. 2 컨볼루션 (신호처리, 이미지 프로세싱 등) 0 0 0 입력 픽셀값 0 0 0 0 0 1 1 1 0 0 0 0 1 2 2 1 1 1 0 1 2 2 2 1 1 0 1 2 2 2 1 1 0 0 1 1 1 1 1 출력값은 픽셀위치의 주위 모든 픽셀값들의 값과 커널에 의해 결정되는 가중치 평균값 (weighted sum) 이 된다 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 -4 Convolution 커널 새로운 픽셀값 (출력) -8
    28. 28. 컨볼루션— 게임의 경우 카메라 초점효과 (Depth of field) In focus © Halo 3 Bungie Studios
    29. 29. 컨볼루션— 게임의 경우 뽀사시 효과 (Film bloom) 과도한 빛이 광원주위로 산란되는 현상 © Crysis Crytek GmbH
    30. 30. CONVOLUTION — 게임의 경우 반투명한 피부질감 렌더링 (Subsurface scattering) 불투명한 오브젝트의 경우 반투명한 물체의 내부 광원 산란효과 Jimenez 2008 Used In Samaritan, Crysis2 and other games 직접광원 텍스쳐 convolution
    31. 31. 컨볼루션 — 수퍼컴퓨팅의 경우 지질탐사 (유정위치 탐색등)에서 사용되는 Reverse Time Migration 파동시뮬레이션의 정확도를 돕이기 위해 가변량계산시 convolution사용 Petroleum Geo Services complex wave interaction near a salt tooth propagated using AxRTM
    32. 32. 게임연산과 수퍼컴퓨팅의 공통분모 두번째 • 수많은 독립된 연산사용 ( 수백만 ~ 수억개) • 컨볼루션 연산시 픽셀위치에 가까운 픽셀들 데이터 사용 • 게임의 경우 – 텍스쳐 캐쉬를 이용한 성능향상 • 일반연산의 경우 – 공유 메모리 (shared memory) 와 상수 메모리 (constant memory)를 사용한 성능향상 • 같은 메모리 구조를 가지는 하드웨어를 사용하므로 메모리 접근방식개선을 통한 최적화기법 공유
    33. 33. 3 편미분 방정식 (PARTIAL DIFFERENTIAL EQUATIONS, PDEs)
    34. 34. 유체역학에서의 편미분방정식 계산 속도 압력
    35. 35. 게임에서의 유체효과를 위한 물리연산
    36. 36. 수퍼컴퓨터에서의 유체역학 기상예측, 제품 디자인, 공기역학설계 등 On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.
    37. 37. 게임연산과 수퍼컴퓨팅의 공통분모 세번째 • 수많은 독립된 연산사용 ( 수백만 ~ 수억개) • 수학연산 등 높은 계산량 (FLOPS) 이 필요한 분야에 사용 • 게임 • 정해진 시간안에 최대한의 효과 • 속도 > 정확도 • 단정밀도 부동소수점 연산에 최적화 (single precision float) / GeForce 제품군 • 수퍼컴퓨팅 • 최대의 데이터와 높은 수준의 정확도 요구 • 정확도 > 속도 • 배정밀도 부동소수점 연산에 최적화 (double precision) / Tesla 제품군
    38. 38. 4. 푸리에 변환 (FFT) + = +
    39. 39. FFT — 게임의 경우 렌즈 플레어 등의 후처리 효과 HDRI 등에서 발생하는 고휘도 픽셀입력값 FFT Frequent 주파수 영역 이미지 Domain (Frequency Domain Image ) Image x 주파수 영역 커널 Frequent (Frequency Domain Domain Kernel Kernel ) 효과가 적용된 이미지 FFT-1
    40. 40. © 3D Mark 11 Futuremark
    41. 41. © 3D Mark 11 Futuremark
    42. 42. © 3D Mark 11 Futuremark
    43. 43. FFT — 게임의 경우 파도 시뮬레이션 NVIDIA Ocean Demo
    44. 44. FFT — 수퍼컴퓨팅의 경우 난류 시뮬레이션 (Turbulence simulation), 단백질 합성, 분자동역학, 영상처리, 암호학
    45. 45. 게임연산과 수퍼컴퓨팅의 공통분모 네번째 • 자주사용되는 핵심 기술 (FFT, 선형대수등)에 대한 라이브러리들 존재 • 게임 • 주로 게임용 개발환경인 DirectX/Direct Compute사용 • 게임엔진및 독립적인 미들웨어 사용 • 수퍼컴퓨팅 • CUDA 용 기본 라이브러리들 • CUFFT, CUBLAS 등 • 가속라이브러리의 핵심커널은 대동소이함 (같은 하드웨어 구조)
    46. 46. 수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성 메모리연산에 의한 병목의 경우들 (메모리 사용량 > 계산량) Gaming HPC Ambient occlusion Sparse Matrix vector multiply
    47. 47. 수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성 수학연산에 의한 병목의 경우들 (계산량 > 메모리 사용량) © Team Fortress 2 Valve Gaming 복잡한 광원효과 계산 AMBER를 사용한 혈액 응집 시뮬레이션 HPC 단백질과 지질 (lipid) 시뮬레이션
    48. 48. 게임연산과 수퍼컴퓨팅의 공통분모 마지막 • 병렬처리 성능의 소프트웨어적인 개선 방법 • 병목을 찾아라 • 메모리 병목 (bandwidth bound) vs 계산 병목 (compute bound)? • 하드웨어에 대한 이해 필요 • Parallel Nsight 등의 툴들 사용
    49. 49. GPU Computing – Game On!
    50. 50. Growth of GPU Computing 100M 430M CUDA –Capable GPUs CUDA-Capable GPUs 150K 1.6M CUDA Downloads CUDA Downloads 50 1 Supercomputers Supercomputer 640 60 University Courses University Courses 37,000 4,000 Academic Papers Academic Papers 2008 2013
    51. 51. 2008 Supercomputing Exhibit Floor
    52. 52. 2012 Supercomputing Exhibit Floor
    53. 53. ADVANCING HEALTHCARE The Chinese Academy of Sciences used a GPU-powered supercomputer to model a complete H1N1 virus for the first time.
    54. 54. ENHANCING FINANCE With NVIDIA Tesla GPUs, J.P Morgan . achieved a 40x speedup of its risk calculations.
    55. 55. MEDICAL BREAKTHROUGHS GPUs enable doctors to perform beating heart surgery with robotic arms that predict and adjust for movement. Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier
    56. 56. 정보 검색 Tweets Per Day 500 Millions 400 500M 300 Tweets 200 1M Expressions 100 <5 0 Minutes 2007 2008 2009 2010 2011 2012
    57. 57. 오디오 검색 SHAZAM 300M 300M User Queries (per month) 200M 100M 0 2008 2009 2010 2011 2012 2013
    58. 58. 패턴 검색
    59. 59. 게임 효과의 미래? 현재 가까운 미래
    60. 60. 수퍼컴퓨팅의 미래? 현재 미래에는?
    61. 61. 수퍼컴퓨팅의 또다른 미래? 현재 미래?
    62. 62. Mobile GPUs are coming Relative Graphics Horsepower 300X Mobile Kepler 8800 GTX 200X PS3 100X iPad 4 iPhone 4 Galaxy S2 1X 2009 2010 2011 2012 2013 2014
    63. 63. Gesture Augmented Reality Beautification Speech Recognition Facial Recognition 모바일 수퍼컴퓨터?
    64. 64. http://nvidiakoreapsc.com

    ×