0
AgendaX86 PROCESSOR EVOLUTIONTHE GPU AS AN ACCELERATORACCELERATED PROCESSING UNITSINTRODUCTION TO OpenCL
Evolving x86 Processors
AMD architecture“Istambul” six-core diagram                      1    2        3         4        5       6               ...
4P/24-core system examplevery good scalability                                 One memory controller for everyMEMORY      ...
Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores               CHANNELS               ...
Direct Connect Architecture 2.0Balanced and Scalable Design to Support up to 16 Cores* per CPU             CHANNELS       ...
What is next for x86 CPUs• More processor cores to come(12, 16, 16 double cores)• More memory channels(improves memory ban...
Top500 list - beyond the petaflop                             Datacenters in the                            USA will spend...
1997:                  X Garry Kasparov       IBM Deep Blue
The World’s Most Powerful GPU                    =
2011 GPU Architecture    AMD Radeon™ HD 6900 SeriesDual graphics enginesNew VLIW4 core architectureUp to 24 SIMD enginesUp...
Designing very efficient GPUsFull load: 180W; Idle:27W  16                                                                ...
Old and New in High Performance ComputingOld: Power is free, Transistors are expensiveNew: Power expensive, Transistors fr...
GPUs: more than just gaming                  Processing power – millions of operations per second    Single Core   12     ...
DirectX® 11 Multi-Threading Application, DirectX runtime, and DirectX driver can each run in separate  threads Tasks lik...
Today’s GPUs focused onGAMINGENTERTAINMENTPRODUCTIVITY
DirectX® 11 Tessellation                     DirectX® 10     DirectX® 11                   No Tessellation   TessellationI...
5/26/2011
5/26/2011
Research companies already usingOil exploration   Wheather forecast   Fluid Dynamics   Nature simulation       21
AMD Balanced Platform                                                     GPU is ideal for data parallel algorithmsCPU is ...
ATI Stream Technology is…Heterogeneous: Developers leverage AMD GPUs and x86CPUs for optimal application performance and u...
Improvements already reached consumers                                               80%                                  ...
GPU-accelerated video transcoding                                               Ipod Video       HD Video           Up to ...
Video Transcoding SampleNo GPU Acceleration                          CPU Usage: 100%                                      ...
Video Transcoding SampleATI GPU Acceleration                               CPU Usage: 45%                                 ...
FUSION TECHNOLOGY
Today  Multi-core CPU             TeraFLOPS-class GPU  ~800 million transistors        Up to 2 billion transistors  Multi-...
A new Era on performance evolution                                                                                       H...
A new Era on performance evolution      Single-Core          Multi-CoreCPU              Core efficiency                   ...
Putting all together – The Future is Fusion  AMD “Istambul” six-core processor                                     RV500 G...
Putting all together – The Future is Fusion  AMD “Istambul” six-core processor                  RV700 GPU Core (2008-2009)...
Putting all together – The Future is Fusion  AMD “Istambul” six-core processor   RV700 GPU Core                           ...
2011: welcome to the APU time!CPU                    APU                      GPU “Supercomputing power in a notebook plat...
One Design, Fewer Watts, Massive Capability                                                         “Zacate”              ...
Graphics and Media Processing Efficiency Improvements     2010 IGP-based Platform                                      201...
“Ontario” & “Zacate” Architecture APU >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB  L2, 64-bit FPU) >C6 and power gating >A...
OpenCLWorking together
ATI Stream SDK:OpenCL™ For Multicore x86 CPUs and GPUshttp://developer.amd.com/ The Power of Fusion: Developers leverage h...
OpenCL™: Game-Changing DevelopmentEnabling Broad Adoption of GP-GPU Capabilities    Industry standard API: Open, multipla...
Open Standards:Maximize Developer Freedom and Addressable Market      Vendor specific                    Vendor neutral  C...
Comparing OpenCL™ and DirectX® 11 DirectComputeHow will developers choose between OpenCL™ and DirectX® 11DirectCompute? F...
Anatomy of OpenCL™                             Language Specification  • C-based cross-platform programming interface  • S...
OpenCL Example                                       Scalar   void square(int n, const float *a, float *result)   {      i...
SummaryX86 PROCESSOR EVOLUTIONTHE GPU AS AN ACCELERATORACCELERATED PROCESSING UNITSINTRODUCTION TO OpenCLhttp://developer....
Obrigado!roberto.brandao@amd.com
roberto.brandao@amd.com    Obrigado!
Amd   accelerated computing -ufrj
Upcoming SlideShare
Loading in...5
×

Amd accelerated computing -ufrj

851

Published on

Apresentacao sobre computacao acelerada e APU - UFRJ - Mai/2011

Published in: Technology, Business
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
851
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Amd accelerated computing -ufrj"

  1. 1. AgendaX86 PROCESSOR EVOLUTIONTHE GPU AS AN ACCELERATORACCELERATED PROCESSING UNITSINTRODUCTION TO OpenCL
  2. 2. Evolving x86 Processors
  3. 3. AMD architecture“Istambul” six-core diagram 1 2 3 4 5 6 Balanced Native caches L2 L2 L2 L2 L2 L2 six-core processor L3 Cache Lower memory latency CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Fast full-duplex Chipset bus
  4. 4. 4P/24-core system examplevery good scalability One memory controller for everyMEMORY MEMORY processor Full-duplex Hyper Transport links (up to 5.2GHz)MEMORY MEMORY Bus Optimization: HT Assist (Cache Probe Filtering) Still the only available 4P system with Direct Connect Architecture
  5. 5. Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU No front side bus HyperTransport™ technology Integrated memory controller NUMA memory architecture
  6. 6. Direct Connect Architecture 2.0Balanced and Scalable Design to Support up to 16 Cores* per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU • 1-hop between processors • Four memory channels • Up to 50% more DIMMs • Up to 33% increase in CPU to CPU communication speed±
  7. 7. What is next for x86 CPUs• More processor cores to come(12, 16, 16 double cores)• More memory channels(improves memory bandwidth percore)• Improved IPC(8 per cycle is a target)
  8. 8. Top500 list - beyond the petaflop Datacenters in the USA will spend more than $3 billion on energy in 2009
  9. 9. 1997: X Garry Kasparov IBM Deep Blue
  10. 10. The World’s Most Powerful GPU =
  11. 11. 2011 GPU Architecture AMD Radeon™ HD 6900 SeriesDual graphics enginesNew VLIW4 core architectureUp to 24 SIMD enginesUp to 96 Texture UnitsUpgraded render back-ends Improved anti-aliasing performanceFast 256-bit GDDR5 memory interface Up to 5.5 GbpsNew GPU compute features
  12. 12. Designing very efficient GPUsFull load: 180W; Idle:27W 16 14.47 14 GFLOPS/W 12 GFLOPS/W GFLOPS/mm2 10 7.50 8 4.50 7.90 GFLOPS/mm2 6 2.01 2.21 4.56 4 1.07 2.24 2 0.42 1.06 0.92 0 Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09 ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD X1800 XT X1900 XTX 2900 PRO 3870 4870 5870
  13. 13. Old and New in High Performance ComputingOld: Power is free, Transistors are expensiveNew: Power expensive, Transistors free(Can put more transistors on chip than can afford to turn on)Old: Multiplies are slow, Memory access is fastNew: Multiplies fast, Memory slow(up 200 clocks to DRAM memory, 4 clocks for FP multiply)Old: Increasing Instruction Level Parallelism via compilers innovationNew: Explicit thread and data parallelism must be exploited
  14. 14. GPUs: more than just gaming Processing power – millions of operations per second Single Core 12 Dual Core 24 Quad Core 48 Hexa Core 72 12 Cores 144 2700Radeon HD 5970 Both use GPUs Wii Sports - Golf Oil exploration platform - 2010 15
  15. 15. DirectX® 11 Multi-Threading Application, DirectX runtime, and DirectX driver can each run in separate threads Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread DirectX® 10 DirectX® 11 16
  16. 16. Today’s GPUs focused onGAMINGENTERTAINMENTPRODUCTIVITY
  17. 17. DirectX® 11 Tessellation DirectX® 10 DirectX® 11 No Tessellation TessellationImages courtesy of Unigine Corp. 18
  18. 18. 5/26/2011
  19. 19. 5/26/2011
  20. 20. Research companies already usingOil exploration Wheather forecast Fluid Dynamics Nature simulation 21
  21. 21. AMD Balanced Platform GPU is ideal for data parallel algorithmsCPU is excellent for running some like image processing, CAE, etcalgorithms  Great use for ATI Stream  Ideal place to process if GPU is technology fully loaded  Great use for additional GPUs  Great use for additional CPU cores Graphics Workloads Serial/Task-Parallel Other Highly Workloads Parallel Workloads Delivers optimal performance for a wide range of platform configurations
  22. 22. ATI Stream Technology is…Heterogeneous: Developers leverage AMD GPUs and x86CPUs for optimal application performance and user experienceHigh performance: Massively parallel, programmable GPUarchitecture delivers unprecedented performance and powerefficiencyIndustry Standards: OpenCL™ and DirectCompute 11 enablecross-platform development Sciences Government Engineering Gaming Digital Productivity Content Creation
  23. 23. Improvements already reached consumers 80% 70% 60% 50% ATI Stream 40% 30% 20% 10% 0% Processor utilization Adobe Flash plugin used by Youtube.com  Better image quality and video smoothness  Lower processor usage
  24. 24. GPU-accelerated video transcoding Ipod Video HD Video Up to 6x faster when using an AMD graphics card
  25. 25. Video Transcoding SampleNo GPU Acceleration CPU Usage: 100% Using four CPU Cores GPU Usage: 1% CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26
  26. 26. Video Transcoding SampleATI GPU Acceleration CPU Usage: 45% GPU Usage: 35% Using hundreds of Stream ProcessorsCPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23) GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15) 27
  27. 27. FUSION TECHNOLOGY
  28. 28. Today Multi-core CPU TeraFLOPS-class GPU ~800 million transistors Up to 2 billion transistors Multi-tasking Jogos em multiplos monitores Video e audio Full HD
  29. 29. A new Era on performance evolution Heterogeneous Single-Core Multi-Core computing Challenge: Challenge: Pros: Power consumption Power consumption  Performance Complexity Software  Power efficient Cons: Software availabilitySingle-thread Performance Performance ? We are here We are here We are here Time Time x Cores Time
  30. 30. A new Era on performance evolution Single-Core Multi-CoreCPU Core efficiency Software Acceleration Multimedia Gaming GPU
  31. 31. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV500 GPU Core (2006) 1 2 3 4 5 6 Ring L2 L2 L2 L2 L2 L2 Stop Client Interface Client Interface Cache L3 Client Interface Client Interface CROSSBAR Ring Memory Ring Stop Controller Stop Hyper Memory Client Interface Transport Controller Client Interface Client Interface Client Interface HyperTransport Ring Stop PCI-e Chipset
  32. 32. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core (2008-2009) 1 2 3 4 5 6 L2 L2 L2 L2 L2 L2 Cache L3 CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Chipset
  33. 33. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core CROSSBAR CROSSBAR
  34. 34. 2011: welcome to the APU time!CPU APU GPU “Supercomputing power in a notebook platform whose battery lasts for a full day”
  35. 35. One Design, Fewer Watts, Massive Capability “Zacate” Discrete-level AMD Dual-CoreNorthbridge + CPU + DirectX® 11 GPU = Fusion APU  66 sq. mm  117 sq. mm  59 sq. mm  75 sq. mm  13 watts  25 watts  8 watts  18 watts
  36. 36. Graphics and Media Processing Efficiency Improvements 2010 IGP-based Platform 2011 APU-based Platform ~17 GB/sec ~17 GB/sec CPU Cores DDR3 DIMM CPU Memory UNB / MC Cores CPU Chip DDR3 DIMM APU Chip MC Memory UVD UNB GPU ~27 GB/sec~7 GB/sec Graphics requires GPU UVD memory bandwidth ~27 GB/sec PCIe to bring full SB Functions capabilities to life  3X bandwidth between GPU and memory  Even the same sized GPU is substantially more effective in this configuration PCIe  Eliminate latency and power associated with the extra chip crossing Bandwidth pinch points and latency  Substantially smaller physical foot print hold back the GPU capabilities
  37. 37. “Ontario” & “Zacate” Architecture APU >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU) >C6 and power gating >Array of SIMD Engines • DX11 graphics performance • Industry leading 3D and graphics processing >3rd Generation Unified Video Decoder >H.264, VC1, DixX/Xvid format >DDR3 800-1066, 2 DIMMs, 64 bit channel >BGA package Display and I/O >Two dedicated digital display interfaces • Configurable externally as HDMI, DVI, and/or Display Port • Also supports a single link LVDS for internal panels >Integrated VGA >5x8 PCIe® > “Hudson” Fusion Controller Hub
  38. 38. OpenCLWorking together
  39. 39. ATI Stream SDK:OpenCL™ For Multicore x86 CPUs and GPUshttp://developer.amd.com/ The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience • First complete OpenCL™ development platform • Certified OpenCL 1.0 compliant by the Khronos Group • Write code that can scale well on multi-core CPUs and GPUs • AMD delivers on the promise of OpenCL™, with both high- performance CPU and GPU technologies • Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
  40. 40. OpenCL™: Game-Changing DevelopmentEnabling Broad Adoption of GP-GPU Capabilities  Industry standard API: Open, multiplatform development platform for heterogeneous architectures  The power of Fusion: Leverages CPUs and GPUs for balanced system approach  Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.  Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution  Momentum: Enormous interest from mainstream developers and application ISVs More stream-enabled applications across all markets
  41. 41. Open Standards:Maximize Developer Freedom and Addressable Market Vendor specific Vendor neutral Cross-platform limiters Cross-platform enablers • Apple Display Connector • 3dfx Glide Digital Visual OpenCL™ DirectX® Interface • Nvidia CUDA • Nvidia Cg • Rambus Certified DP JEDEC OpenGL® • Unified Display Interface
  42. 42. Comparing OpenCL™ and DirectX® 11 DirectComputeHow will developers choose between OpenCL™ and DirectX® 11DirectCompute? Feature set is similar in both APIsDirectX® 11 DirectCompute Easiest path to add compute capabilities to existing DirectX applications Windows Vista® and Windows® 7 onlyOpenCL™ Ideal path for new applications porting to the GPU for the first time True multiplatform: Windows®, Linux®, MacOS Natural programming without dealing with a graphics API
  43. 43. Anatomy of OpenCL™ Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error • Online or offline compilation and build of compute kernel executables • Includes a rich set of built-in functions Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources
  44. 44. OpenCL Example Scalar void square(int n, const float *a, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * a[i]; } Data-Parallel kernel dp_square (const float *a, float *result) { int id = get_global_id(0); result[id] = a[id] * a[id]; } // dp_square executes oven “n” work-items
  45. 45. SummaryX86 PROCESSOR EVOLUTIONTHE GPU AS AN ACCELERATORACCELERATED PROCESSING UNITSINTRODUCTION TO OpenCLhttp://developer.amd.com 46
  46. 46. Obrigado!roberto.brandao@amd.com
  47. 47. roberto.brandao@amd.com Obrigado!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×