This document discusses accelerated computing using GPUs and OpenCL. It begins by covering the evolution of x86 processors towards multi-core designs and the use of GPUs as accelerators. It then introduces accelerated processing units that combine CPU and GPU components. The document concludes by introducing OpenCL as an open standard for programming GPUs and heterogeneous systems that allows developers to write code that scales across CPUs and GPUs.
5. 4P/24-core system examplevery good scalability One memory controller for every processor Full-duplex Hyper Transport links (up to 5.2GHz) Bus Optimization: HT Assist (Cache Probe Filtering) Still the only available 4P system with Direct Connect Architecture MEMORY MEMORY MEMORY MEMORY
6. Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores 2 MEMORY CHANNELS 2 MEMORY CHANNELS 8 DIMMs per CPU 8 DIMMs per CPU 2 MEMORY CHANNELS 2 MEMORY CHANNELS 8 DIMMs per CPU 8 DIMMs per CPU No front side bus HyperTransport™ technology Integrated memory controller NUMA memory architecture
15. 2011 GPU Architecture AMD Radeon™ HD 6900 Series Dual graphics engines New VLIW4 core architecture Up to 24 SIMD engines Up to 96 Texture Units Upgraded render back-ends Improved anti-aliasing performance Fast 256-bit GDDR5 memory interface Up to 5.5 Gbps New GPU compute features
17. Old and New in High Performance Computing Old: Power is free, Transistors are expensive New: Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: Multiplies fast, Memory slow (up 200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers innovation New: Explicit thread and data parallelism must be exploited
18. GPUs: more than just gaming 15 2700 Both use GPUs Oil exploration platform - 2010 Wii Sports - Golf
19.
20. Tasks like loading a texture or compiling a shader can execute in parallel with main rendering threadDirectX® 10 DirectX® 11 16
28. Great use for additional CPU coresGraphics Workloads Other Highly Parallel Workloads Serial/Task-Parallel Workloads Delivers optimal performance for a wide range of platform configurations
29. ATI Stream Technology is… Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience High performance:Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency Industry Standards:OpenCL™ and DirectCompute 11 enable cross-platform development Engineering Sciences Government Gaming Digital Content Creation Productivity
30.
31.
32. Video Transcoding SampleNo GPU Acceleration CPU Usage: 100% Frames Frames Using four CPU Cores GPU Usage: 1% 26
33. Video Transcoding SampleATI GPU Acceleration CPU Usage: 45% Control Control Frames Frames GPU Usage: 35% Using hundreds of Stream Processors 27
35. Today TeraFLOPS-class GPU Multi-core CPU ~800 million transistors Multi-tasking Up to 2 billion transistors Jogosemmultiplosmonitores Video e audio Full HD
36.
37. Power efficientCons: Software availability ? Single-thread We are here Performance Performance We are here We are here Time x Cores Time Time
38. A new Era on performance evolution Multi-Core Single-Core CPU Core efficiency Software Acceleration Low power consumption Multimedia Gaming GPU
78. Comparing OpenCL™ and DirectX® 11 DirectCompute How will developers choose between OpenCL™ and DirectX® 11 DirectCompute? Feature set is similar in both APIs DirectX® 11 DirectCompute Easiest path to add compute capabilities to existing DirectX applications Windows Vista® and Windows® 7 only OpenCL™ Ideal path for new applications porting to the GPU for the first time True multiplatform: Windows®, Linux®, MacOS Natural programming without dealing with a graphics API
79.
80. Subset of ISO C99 with language extensions - familiar to developers
Our new technology pillars that will help the channel differentiate
Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
Original legal approval – Maranello Platform Launch, March 2010The first generation DCA introduced features now expected in the market[cover features at bottom quickly and go to next slide]
Original legal approval – Maranello Platform Launch, March 2010Today’s introduction brings DCA 2.0Four memory channels12 DIMMs per CPUSupports up to 12 cores today, will support next-gen core with up to 16 per CPULet’s take a closer look at the effect of memory on workloads [next slide]
done
Add more deep blue computers
Add “All models ATI Radeon™”Add “as of this date the HD5870 GPU has the highest GFLOPS/mm2 of all known products”
Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
Work on the slide (larget text)
Using ATI Stream technology, enjoy better visual quality when you watch streaming video online (YouTube/Hulu) with new video enhancement features.*
Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
Let’s look at today’s compute platforms:You have a Phenom II with 758 million transistors on 45nm process technology on the left On the right you see a 5870 DX11 GPU with 2.15 billion transistors on 40nm process technology. Today, with the emergence of visual computing, you see more work than ever before for the GPU. Especially with, arguably for consumers, the most important workload: video.The explosion of HD video and now HD gaming, means the GPU matters more than ever in the PC platform. More user-generated content puts more of the work onto the GPU such as video processing and rendering and 3D user interface.The era of visual computing is already becoming more about mobility and being able to do more of what I’ve just described on the go. However, users do not want more compute capabilities at the expense of battery life or smaller form factors.Favoring one component over the other or taking a niche approach to balanced visual computing platforms does not meet the needs of the mass market. Usage scenarios favor a combination of GPU/CPU balance and low power..
Now – Many of you are technologists, so you are probably glad to see me finally start talking about some technology – the workload changes are also dramatically impacting chip architectures.This chart does a good job of demonstrating the evolution of chip architectures:Starting on X axis on the left you go back in time to highly programmable, single core CPUs which aimed to increase throughput (Y axis) over time by first adding threads, then cores.GPUs on the other hand, started out way to the right in terms of throughput and have been becoming more and more programmable.We call this evolution the move from Homogenous Computing to Heterogeneous Computing , finally resulting up on the top right where the two arrows meet in what we call an APU. A combination of different types of cores, working closely together on different type workloads for optimum performance per watt per mm2This AMD’s architectural vision of the future and where we are heading with our first APU in 2011, the Llano processor – our first integrated CPU + GPU on a single piece of silicon.
WHERE WE ARE TODAYAttempt to provide an environment in which optimized hardware can provide higher absolute performance, better power efficiency, and lower cost. At the same time, the goal is to dramatically improve programmer productivity as the cost of software development is substantially the same as hardware developmentThis means support for heterogeneous multi-core hardware and a much more effective application programming environment are critical.This chart does a good job of summarizing the evolution of chip architectures:Starting on X axis on the left you go back in time to highly programmable, single core CPUs which aimed to increase throughput (Y axis) over time by first adding threads, then cores.GPUs on the other hand, started out way to the right in terms of throughput and have been becoming more and more programmable.We call this evolution the move from Homogenous Computing to Heterogeneous Computing , finally resulting up on the top right where the two arrows meet in what we call an APU. A combination of different types of cores, working closely together on different type workloads for optimum performance per watt per mm2
The need for this optimal energy-efficient balance of CPU and GPU represents the beginning of a new era of computing in 2011.The Fusion of CPU and GPU compute power is what the next chapter in visual computing requires – a powerful visual computing experience at home or on the go without compromise. Our AMD Fusion™ design is driven by mobility and is based on a low-power visual compute architecture that will enhance active and resting battery life while increasing both CPU and GPU performance. This is the culmination of the vision of ‘One AMD’ and only AMD can deliver the GPU and CPU combination that will be the future of computing
Review slide to determine message
The Industry has always tried to move away from proprietary technology and towards open standards when available.The proprietary Apple Display Connector never became popular since DVI was license-free and widely available.3dfx’s Glide API for 3D graphics failed to stick around in the market long after DirectX was available on a wide variety of hardware.nVIDIA’s Cg language was never widely used since OpenGL and DirectX provided a compelling open alternativeThe Unified Display Interface was a failed interface backed by Intel and nVIDIA, which was deprecated in favor of the license-free DisplayPort standard.RAMBUS has tried to bring many proprietary memory technologies to market, but have always been displaced by JEDEC open memory standards.CUDA is a proprietary GPGPU model into the market whose specification is controlled by only one company, we believe it will soon be replaced by OpenCL and the DirectX Compute Shader.