SlideShare a Scribd company logo
GPU Computing: Past, Present and Future
with ATI Stream Technology



Michael Monkang Chu
Product Manager, ATI Stream Computing Software
michael.chu@amd.com


March 9, 2010
Harnessing the Computational Power of GPUs

• GPU architecture increasingly
  emphasizes programmable shaders
  instead of fixed function logic
• Enormous computational capability
  for data parallel workloads
• New math for datacenters: enables
  high performance/watt and                               Software
  performance/$                                          Applications



                                                                     Graphics Workloads
                                             Serial and Task
                                             Parallel               Data Parallel
                                             Workloads              Workloads




        2 | GPU Computing – Past, Present and Future with ATI Stream Technology
ATI Stream Technology is…
Heterogeneous: Developers leverage AMD GPUs and CPUs for
optimal application performance and user experience

Industry Standards: OpenCL™ and DirectCompute 11 enable
cross-platform development

High performance: Massively parallel, programmable GPU
architecture enables superior performance and power efficiency




                    Gaming                   Digital Content                 Productivity
                                                Creation




         Sciences               Government                     Engineering



     3 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology
          3 | GPU Computing – Update | Confidential
ATI Radeon™ X800 Architecture (2004)




     4 | GPU Computing – Past, Present and Future with ATI Stream Technology
ATI Radeon™ HD 3870 Architecture (2007)




     5 | GPU Computing – Past, Present and Future with ATI Stream Technology
GPU Compute Timeline




                             ATI Stream SDK v0.9               ATI Stream SDK v1.0           ATI Stream SDK v2.0
Folding @Home
                             Open systems approach             Enhancements to improve       First production version of
Proof of concept achieving                                                                   OpenCL™ for both x86
                             to drive broad customer           computation performance
>30x speedup over CPUs                                                                       CPUs and AMD GPUs
                             adoption (Brook+ & CAL/IL)




Stream Computing                                               AMD FireStream™ 9250           ATI Radeon™ HD 5870 GPU
                             AMD FireStream™ 9170
Development Platform         GPU Compute Accelerator           GPU Compute Accelerator        2.72 TFLOPS - SP
CTM for data parallel        First GPU Stream processor with   Breaks the 1 TFLOPS barrier    544 GFLOPS - DP
programming                  double-precision                  Up to 8 GFOPS/watt
                             floating point



       2006                          2007                              2008                           2009


        6 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology
             6 | GPU Computing – Update | Confidential
Enhancing GPUs for Computation
              ATI Radeon™ HD 4870 Architecture (2008)
• 800 stream processing units
  arranged in 10 SIMD cores
• Up to 1.2 TFLOPS peak single
  precision floating point performance;
  Fast double-precision processing w/
  up to 240 GFLOPS
• 115 GB/sec GDDR5 memory
  interface
• Up to 480 GB/s L1 & 384 GB/s L2
  cache bandwidth
• Data sharing between threads
• Improved scatter/gather operations
  for improved GPGPU memory
  performance
• Integer bit shift operations
  for all units – useful for crypto,
  compression, video processing
• More aggressive clock gating for
  improved performance per watt

      7 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology
           7 | GPU Computing – Update | Confidential
 7
ATI Radeon™ HD 4870 Architecture
                                GDDR5 Memory Interface




                        Texture              SIMD
                         Units
                                             Cores




                                                                      UVD & Display
                                                                       Controllers

                                          PCI Express Bus Interface



   8 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology
        8 | GPU Computing – Update | Confidential

                                                                        8
GPU Compute Processing Power Trend

 3000
                            GigaFLOPS *                                                                             Cypress
                                                                                                                    ATI RADEON™
                                                                                                                     HD 5870
 2500           * Peak single-precision performance;
                  For RV670, RV770 & Cypress divide by 5 for peak double-precision performance     RV770
                                                                                                  ATI RADEON™
                                                                                                   HD 4800
                                                                                RV670              ATI FirePro™
 2000                                                                          ATI RADEON™          V8700                             OpenCL™ 1.0
                                                           R600                 HD 3800
                                                                                ATI FireGL™
                                                                                                 AMD FireStream™
                                                                                                                                       DirectX® 11
                                                        ATI RADEON™                                  9250
                                                         HD 2900                  V7700                                                 2.25x Perf.
                                   R580(+)               ATI FireGL™         AMD FireStream™
                                                                                                     9270
                                   ATI RADEON™                                                                                         <+18% TDP
 1500                                 X19xx                V7600                   9170
               R520               ATI FireStream™          V8600
             ATI RADEON™                                   V8650                                                           2.5x ALU
               X1800                                                                                                       increase
              ATI FireGL™
 1000          V7200
               V7300                                                                                               ATI Stream SDK
               V7350                                                                                               CAL+IL/Brook+
  500
                                                                                                     Double-precision
                                                            GPGPU                  Unified
                                                                                                      floating point
                                                            via CTM                Shaders
       0




                                                                                                            08
  -05




                                                                      7




                                                                                                                                             9
                                                                                          7
                                                6
                       -06




                                                                                                                           8
                                                                  r- 0




                                                                                                                                          l-0
                                                                                      v -0
                                             t-0




                                                                                                                        c-0
                                                                                                         n-
   p




                     ar




                                                                                                                                       Ju
                                                               Ap
                                          Oc




                                                                                    No




                                                                                                                     De
Se




                                                                                                      Ju
                   M




           9 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology
                9 | GPU Computing – Update | Confidential
The World’s Most Efficient GPU*

                                                                                                                          14.47
                                                                                                                           GFLOPS/W

          GFLOPS/W
          GFLOPS/mm2
                                                                                                                 7.50



                                                                                       4.50                                           7.90
                                                                                                                                  GFLOPS/mm2

                                  2.01                       2.21                                                       4.56
      1.07                                                                                        2.24
     0.42                                 1.06                        0.92




     *Based on comparison of consumer client single-GPU configurations as of 12/08/09. ATI Radeon™ HD 5870 provides
     14.47 GFLOPS/W and 7.90 GFLOPS/mm2 vs. NVIDIA GTX 285 at 5.21 GFLOPS/W and 2.26 GFLOPS/mm2.
  10 |10 | Stream ComputingPast, Present and Future with ATI Stream Technology
      ATI GPU Computing – Update | Confidential
ATI Radeon™ HD 5870 (“Cypress”) Architecture (2009)


2.72 Teraflops Single Precision,
   544 Gigaflops Double
   Precision
•    Full Hardware Implementation
     of DirectCompute 11 and
     OpenCL™ 1.0
•    IEEE754-2008 Compliance
     Enhancements
•    Additional Compute Features:
      • 32-bit Atomic Operations
      • Flexible 32kB Local Data
        Shares
      • 64kB Global Data Share
      • Global synchronization
      • Append/consume buffers




       11 |11 | Stream ComputingPast, Present and Future with ATI Stream Technology
           ATI GPU Computing – Update | Confidential
SIMD Cores
Each core:
   – Includes 80 scalar stream processing units in total + 32KB Local Data
     Share
   – Has its own control logic and runs from a shared set of threads
   – Has dedicated fetch unit w/ 8KB L1 cache
   – Communicates with other SIMD cores via 64KB global data share




   12 |12 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
Thread Processors
2.7 TeraFLOPS Single Precision
544 GigaFLOPS Double Precision
      7x more than Nvidia Tesla C1060*
Increased IPC                                                                        Stream Cores
      More flexible dot products                                4 32-bit FP MAD per clock       Special functions
                                                                                                 1 32-bit FP MAD
       Co-issue MUL & dependent ADD in a
                                                             2 64-bit FP MUL or ADD per clock
                                                                1 64-bit FP MAD per clock           per clock


       single clock
                                                             4 24-bit Int MUL or ADD per clock


      Sum of Absolute Differences (SAD)
           – 12x speed-up with native instruction
           – Used for video encoding, computer vision
           – Exposed via OpenCL extension
      DirectX 11 bit-level ops
           – Bit count, insert, extract, etc.
      Fused Multiply-Add                               Each Thread Processor includes:
      Improved IEEE-754 FP compliance                      4 Stream Cores + 1 Special Function
           – All rounding modes                              Stream Core
                                                            Branch Unit
           – FMA (Cypress only)
                                                            General Purpose Registers
           – Denorms (Cypress only)

           – Flags
       * Based on published figure of 78 GigaFLOPS

       13 |13 | Stream ComputingPast, Present and Future with ATI Stream Technology
           ATI GPU Computing – Update | Confidential
Memory Hierarchy

Optimized memory controller area
EDC (Error Detection Code)
       CRC Checks on Data Transfers for
        Improved Reliability at High Clock Speeds
GDDR5 Memory Clock temperature
compensation
       Enables Speeds Approaching 5 Gbps
Fast GDDR5 Link Retraining
       Allows Voltage & Clock Switching on the
        Fly without Glitches
Increased texture bandwidth
       Up to 68 billion bilinear filtered texels/sec
       Up to 272 billion 32-bit fetches/sec
Increased cache bandwidth
       Up to 1 TB/sec L1 texture fetch bandwidth
       Up to 435 GB/sec between L1 & L2
Doubled L2 cache
       128kB per memory controller
New DirectX 11 texture features
       16k x 16k max resolution
       New 32-bit and 64-bit HDR
        block compression modes (BC6/7)



    14 |14 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
OpenCL™: Game-Changing Development
Enabling Broad Adoption of GP-GPU Capabilities


    Industry standard API: Open, multiplatform development
     platform for heterogeneous architectures
    The power of Fusion: Leverages CPUs and GPUs for
     balanced system approach
    Broad industry support: Created by architects from AMD,
     Apple, IBM, Intel, Nvidia, Sony, etc.
    Fast track development: Ratified in December 2008; AMD is
     the first company to provide a complete OpenCL solution
    Momentum: Enormous interest from mainstream
     developers and application ISVs


                   More stream-enabled applications across
                     all markets



     15 |15 | Stream ComputingPast, Present and Future with ATI Stream Technology
         ATI GPU Computing – Update | Confidential
ATI Stream SDK v2.01:
OpenCL™ For Multicore x86 CPUs and GPUs


   The Power of Fusion: Developers leverage heterogeneous
      architecture to enable superior user experience
   • First complete OpenCL™ development platform
   • Certified OpenCL™ 1.0 compliant by the Khronos Group1
   •      Write code that can scale well on multi-core CPUs and GPUs
   •      AMD delivers on the promise of support for OpenCL™, with both
          high-performance CPU and GPU technologies
   •      Available for download now – includes documentation, samples,
          and developer support

                                                                                                     Software
                                                                                                    Applications



                                                                                                                    Graphics Workloads
                                                                                    Serial and
                                                                                    Task Parallel                Data Parallel Workloads

Product Page: http://developer.amd.com/stream
                                                                                    Workloads


               1Conformance logs submitted for the ATI Radeon™ HD 5800 series GPUs, ATI Radeon™ HD 5700 series GPUs, ATI Radeon™ HD 4800 series GPUs, ATI
               FirePro™ V8700 series GPUs, AMD FireStream™ 9200 series GPUs, ATI Mobility Radeon™ HD 4800 series GPUs and x86 CPUs with SSE3.

         16 |16 | Stream ComputingPast, Present and Future with ATI Stream Technology
             ATI GPU Computing – Update | Confidential
Anatomy of OpenCL™


                            Language Specification
 • C-based cross-platform programming interface
 • Subset of ISO C99 with language extensions - familiar to developers
 • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
 • Online or offline compilation and build of compute kernel executables
 • Includes a rich set of built-in functions



                                Platform Layer API

 • A hardware abstraction layer over diverse computational resources
 • Query, select and initialize compute devices
 • Create compute contexts and work-queues



                                    Runtime API
 • Execute compute kernels
 • Manage scheduling, compute, and memory resources


    17 |17 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
OpenCL™ Programming Model
Execution Model

•   Compute kernel is basic unit of execution
•   Execution can occur in-order or out-of-order
•   Kernel can be data-parallel (GPU) or
    task-parallel (CPU)
•   N-dimensional execution domain for kernels
•   Ability to group work-items into work-groups
    for sync/comm


Memory Model

•   Multi-level memory model:
    private, local, constant and global




     18 |18 | Stream ComputingPast, Present and Future with ATI Stream Technology
         ATI GPU Computing – Update | Confidential
OpenCL™ View



 Constants



 Workgroups                                                                          Local memory



 Image cache

                                                                                L2 cache

Global memory

      19 |19 | Stream ComputingPast, Present and Future with ATI Stream Technology
          ATI GPU Computing – Update | Confidential
OpenCLTM Memory space on AMD GPU
                                                       Private     Private          Private     Private
Registers/LDS                                          Memory      Memory           Memory      Memory

                                                                   Work Item                    Work Item
Thread Processor Unit                                Work Item 1
                                                                      M
                                                                                  Work Item 1
                                                                                                   M


                                                         Compute Unit 1               Compute Unit N
SIMD

Local Data Share                                       Local Memory                       Local Memory



Board Mem/Constant Cache                                    Global / Constant Memory Data Cache



                                                    Compute Device

 Board Memory                                                             Global Memory


                                                    Compute Device Memory



    20 |20 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
Example Walk Through – Kernel
__kernel void vec_add (__global const float *a,
                       __global const float *b,
                       __global float *c)
{
    int gid = get_global_id(0);
    c[gid] = a[gid] + b[gid];
}




   21 |21 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
Example Walk Through – Host Code (Init)
// create the OpenCL context on a GPU device
cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
                                     NULL, NULL, NULL);

// get the list of GPU devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY |
                                      CL_MEM_COPY_HOST_PTR,
                            sizeof(cl_float)*n, srcA, NULL);}
memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY |
                                      CL_MEM_COPY_HOST_PTR,
                            sizeof(cl_float)*n, srcB, NULL);
memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
                            sizeof(cl_float)*n, NULL, NULL);

   22 |22 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
Example Walk Through – Host Code (Compile)

// create the program
program = clCreateProgramWithSource(context, 1, &program_source,
                                    NULL, NULL);

// build the program
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel
kernel = clCreateKernel(program, “vec_add”, NULL);




   23 |23 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
Example Walk Through – Host Code (Run)
// set the args values
err = clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],
sizeof(cl_mem));

// set work-item dimensions
global_work_size[0] = n;

// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
                              global_work_size,
                              NULL, 0, NULL, NULL);

// read output array
err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0,
                           n*sizeof(cl_float),
                           dst, 0, NULL, NULL);


   24 |24 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
OpenCL™ Development Directions

 Khronos Group is working to evolve specification to
support future architectural models and features
    Moving compute-oriented optional features into the
     core specification
       – Double Precision, Atomics
    Developing extensions to support specific
     applications
       – Video, Physics, etc.
    Improving cross-platform interoperability
    Tightening mathematical precisions
    Developing more advanced scheduling models


   25 |25 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
OpenCL™ Backend for HMPP                                              (Scheduled to be released)




 • A compiler integrating OpenCL™ stream generator
    o Build portable CPU and GPU hardware specific computations
 • C & Fortran programming directives
    o High level programming interface for scientific applications
 • Runtime library
    o Ease application deployment on multi-GPUs systems

                           C/Fortran HMPP annotated source code



                                       HMPP Compiler

                     HMPP Preprocessor             HMPP OpenCL™ Generator



                        CPU Source                      OpenCL™ Source

                   C/Fortran Std. Compiler          AMD OpenCL™ Compiler


                       HMPP Runtime                   AMD OpenCL™ Driver


                            CPU                        ATI Stream Hardware
                                   www.caps-entreprise.com                                         26
Comparing OpenCL™ and DirectX® 11 DirectCompute


How will developers choose between OpenCL™ and DirectX® 11
DirectCompute?
    Feature set is similar in both APIs
DirectX® 11 DirectCompute
    Easiest path to add compute capabilities to existing
     DirectX® applications
    Windows Vista® and Windows® 7 only
OpenCL™
    Ideal path for new applications porting to the GPU for the
     first time
    True multiplatform: Windows®, Linux®, MacOS
    Natural programming without dealing with a graphics API

   27 |27 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
ATI Stream Technology
Enabled Multimedia Applications



MediaShow 5        PowerDirector 8
MediaShow Espresso PowerDirector 7




SimHD™ Plug-in for TotalMedia Theatre




Roxio Creator™ 2010
Roxio Creator™ 2010 Pro




    28 |28 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
GPU Acceleration in Technical Applications

                 Tomographic Reconstruction: Alain Bonissent, Centre de Physique des
                 Particules de Marseille
                    •Reporting 42-60x* speedups
                    •This image: 7 minutes in optimized C++; 10 seconds in Brook+


                 EDA Simulation: ACCIT
                 •Currently beta testing applications and reporting >10x speedup**




                 Seismic Processing: Brown Deer
                 •Achieving 120x speedup vs CPU on 3D 2nd order finite-difference time-domain (FDTD) seismic
                 processing algorithm




                  Options Trading: Scotia Capital:
                  • Reported a 28x speedup over a quad-core CPU.


                 Neural Networks: Neurala
                 Developing Neurala Technology Platform for advanced brain-based machine
                 learning applications
                 • Report achieving 10-200x speedups on biologically inspired neural models




  29 |29 | Stream ComputingPast, Present and Future with ATI Stream Technology
      ATI GPU Computing – Update | Confidential
High Performance With ATI Stream Technology:
Cryptography through Distributed.net

  Distributed.net provides a distributed model allowing users to donate compute
     cycles to large compute-intensive projects
  RC5-72 – Cryptography algorithm that searches for encryption keys


                                                                      RC5-72
                 1400
                 1200
                 1000
     MKeys/sec




                 800
                 600
                 400
                 200
                   0
                        ATI Radeon            ATI Radeon            ATI Radeon          Nvidia GEForce Nvidia GEForce Nvidia GEForce
                         HD5870                HD4870                HD4850                GTX285         GTX280         GTX260




      *Based on AMD internal testing using RC5-72 clients as of 9/04/09. Results shown in MKeys evaluated per second. Configuration: AMD Phenom™ X4 9950 Black
      Edition processor, 8GB DDR2 RAM, Windows Vista® 32-bit. AMD drivers: ATI Catalyst™ 9.8 (ATI Radeon™ HD 48xx), prerelease driver (ATI Radeon HD 5870).
      Nvidia driver: GeForce 190.62. AMD client: [x86/Stream], v2.9106.513 (beta8). Nvidia client: [x86/CUDA-2.2], v2.9105.512 (beta8).




     30 |30 | Stream ComputingPast, Present and Future with ATI Stream Technology
         ATI GPU Computing – Update | Confidential
Application Acceleration

                                              ACML GPU Accelerated DGEMM Performance
                        240
                        220        FireStream 9270 (measured)
                        200        Quad-Core CPU (peak theoretical)
                        180        Tesla C1060 (peak theoretical)
   (double-precision)




                        160
       GFLOPS




                        140
                        120
                        100
                         80
                         60
                         40
                         20
                          0
                            0

                            0
                           00

                           00
                           00

                           00

                           00

                           00
                           00

                           00

                           00

                           00
                           00

                           00

                           00

                           00

                           00
                           00

                           00

                           00
                          40

                          80
                         12

                         16
                         20

                         24

                         28

                         32
                         36

                         40

                         44

                         48
                         52

                         56

                         60

                         64

                         68
                         72

                         76

                         80
                                                      Matrix Dimension (square matrices)



                         •AMD FireStream™ 9270 on AMD Phenom™ X4 9950/790FX/4GB DDR2 running RHEL 5.1 x86_64
                         •AMD FireStream measured performance includes transfer of operand and result matrices
                         •Quad-Core peak theoretical performance quoted for 3.2GHz Nehalem processor
                         •C1060 peak theoretical performance derived from published specifications
                         •ACML-GPU library freely available from: http://developer.amd.com/gpu/acmlgpu



  31 |31 | Stream ComputingPast, Present and Future with ATI Stream Technology
      ATI GPU Computing – Update | Confidential
NUDT’s Tianhe-1




         1.206 Pflops peak - 563.1 Tflops LINPACK
     6,144 Intel CPUs - 5,120 ATI RV770 GPUs


  32 |32 | Stream ComputingPast, Present and Future with ATI Stream Technology
      ATI GPU Computing – Update | Confidential
Hybrid Parallel Gas Dynamics in OpenCLTM

LANL, IBM, AMD, NVIDIA Booths at SC09

                                                            • SC09 demos on
                                                                  •x86 CPU – Opteron
                                                                  •x86 CPU – Xeon
                                                                  •GPU – NVIDIA
                                                                  •GPU – AMD
                                                                  •Power6
                                                                  •PowerXCell




  Common Source!!


    33 |33 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
A New Era of Processor Performance

                                                                                                                                         Heterogeneous
                                 Single-Core Era                                  Multi-Core Era
                                                                                                                                          Systems Era

                            Constrained by:                                Constrained by:                                        Enabled by:
                                Power                                          Power                                                Abundant data parallelism
                                Complexity                                     Parallel SW availability                             Power efficient GPUs
                                                                               Scalability                                        Constrained by:
                                                                                                                                      Programming models
Single-thread Performance




                                                                                                           Targeted Application
                                                         ?
                                                             Performance
                                                             Throughput




                                                                                                               Performance
                                                                                                  we are
                                                                                                   here
                                                we are
                                                 here

                                                                                                                                           we are
                                                                                                                                            here



                                                                                           Time                                                Time
                                         Time
                                                                                    (# of processors)                                (Data-parallel exploitation)




                                34 |34 | Stream ComputingPast, Present and Future with ATI Stream Technology
                                    ATI GPU Computing – Update | Confidential
A New Era of Processor Performance


                       Microprocessor Advancement
CPU




                        Single-Core          Multi-Core        Heterogeneous
                            Era                 Era             Systems Era




                                                          Heterogeneous           System-level
                                                            Computing            programmable
Programmability




                                                                               OpenCL™/DirectX®
                             Homogeneous                                          driver-based




                                                                                                 Advancement
                                                                                   programs
                              Computing




                                                                                                     GPU
                                                                                    Graphics
                                                                                  driver-based
                                                                                   programs


                                      Throughput Performance                         GPU




                  35 |35 | Stream ComputingPast, Present and Future with ATI Stream Technology
                      ATI GPU Computing – Update | Confidential
AMD Fusion™ APUs Fill the Need




x86 CPU owns                                                          GPU Optimized for
the Software World                                                    Modern Workloads

 Windows®, MacOS                                                      Enormous parallel
  and Linux® franchises                                                 computing capacity
 Thousands of apps                                                    Outstanding
                                                                        performance-per -
 Established programming
                                                                        watt-per-dollar
  and memory model
                                                                       Very efficient
 Mature tool chain
                                                                        hardware threading
 Extensive backward
                                                                       SIMD architecture well
  compatibility for
                                                                        matched to modern
  applications and OSs
                                                                        workloads: video, audio,
 High barrier to entry                                                 graphics




    36 |36 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential
Heterogeneous Computing:
Next-Generation Software Ecosystem
                                                                                     Ease of application
                                                                                       development


                                                End-user Applications

                                                                  High Level
                                                                 Frameworks
                                                                                   Tools: HLL
                       Advanced Optimizations



                                                                                   compilers,
                          & Load Balancing


  Load balance                                                                     Debuggers,
across CPUs and                                   Middleware/Libraries: Video,
                                                                                    Profilers
 GPUs; leverage                                    Imaging, Math/Sciences,
 AMD Fusion™                                               Physics
  performance
   capabilities
                                                                                                         Drive new
                                                        OpenCL™ & DirectX® Compute                      features into
                                                                                                     industry standards


                                                       Hardware & Drivers: AMD Fusion™,
                                                            Discrete CPUs/GPUs




      37 |37 | Stream ComputingPast, Present and Future with ATI Stream Technology
          ATI GPU Computing – Update | Confidential
ONLY AMD!




  38 |38 | Stream ComputingPast, Present and Future with ATI Stream Technology
      ATI GPU Computing – Update | Confidential
Backup Slides




39 |39 | Stream ComputingPast, Present and Future with ATI Stream Technology
    ATI GPU Computing – Update | Confidential
Training and Related Resources

•   Training Resources
     •   Introductory Tutorial to OpenCL™
     •   AMD Developer Inside Track: Introduction to OpenCL™
     •   ATI Stream OpenCL™ Technical Overview Video Series
     •   Porting CUDA to OpenCL™
     •   Image Convolution Using OpenCL™ - A Step-by-Step Tutorial
     •   OpenCL™ Tutorial: N-Body Simulation
•   Related Resources
     •   OpenCL™: The Open Standard for Parallel Programming of GPUs and Multi-core
         CPUs
     •   The Khronos™ Group – OpenCL™ Overview Page
     •   ATI Stream Profiler Product Page
     •   ACML-GPU Product Page
     •   ATI Stream Power Toys Product Page
     •   ATI Stream Developer Articles & Publications
     •   ATI Stream Developer Showcase
     •   ATI Stream Developer Training Resources
     •   KB75 - Tips and suggestions for running SiSoftware Sandra 2010 OpenCL™ GPGPU
         benchmarks
•   ATI Stream SDK v2.01 Documentation



     40 |40 | Stream ComputingPast, Present and Future with ATI Stream Technology
         ATI GPU Computing – Update | Confidential
OpenCL™ vs. CUDA

             Feature                           OpenCL™                        CUDA
Compilation Methods                         Online + Offline               Offline Only

Mathematical Precision                        Well Defined                  Undefined

Math Libraries                             Defined Standard                Proprietary

CPU Support                              OpenCL™ CPU Device             No CPU Support

Native Host Task Support                 Task Parallel Compute          No Native Thread
                                          Model w/ Ability To               Support
                                            Enqueue Native
                                                Threads
Extension Mechanism                       Defined Mechanism                Proprietary

Vendor Support                               Industry-Wide                NVIDIA Only
                                                Support
                                            AMD, Apple, etc.
C Language Support                                  Yes                           Yes


   41 |41 | Stream ComputingPast, Present and Future with ATI Stream Technology
       ATI GPU Computing – Update | Confidential
Disclaimer & Attribution
 DISCLAIMER
 The information presented in this document is for informational purposes only and may contain technical
 inaccuracies, omissions and typographical errors.

 The information contained herein is subject to change and may be rendered inaccurate for many reasons,
 including but not limited to product and roadmap changes, component and motherboard version changes,
 new model and/or product releases, product differences between differing manufacturers, software changes,
 BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
 revise this information. However, AMD reserves the right to revise this information and to make changes
 from time to time to the content hereof without obligation of AMD to notify any person of such revisions or
 changes.

 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND
 ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN
 THIS INFORMATION.

 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY
 PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT,
 SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED
 HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

 ATTRIBUTION
 © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo,
 Phenom, Fusion, Radeon, FirePro, FireStream and combinations thereof are trademarks of Advanced Micro
 Devices, Inc. Microsoft, Windows, Windows Vista, and DirectX are registered trademarks of Microsoft
 Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only
 and may be trademarks of their respective owners.

 OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.




    42 |42 | Stream ComputingPast, Present and Future with ATI Stream Technology
        ATI GPU Computing – Update | Confidential

More Related Content

What's hot

HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
AMD Developer Central
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
HSA Foundation
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
inside-BigData.com
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
Edge AI and Vision Alliance
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
HSA Foundation
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
inside-BigData.com
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
AMD Developer Central
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
HSA Foundation
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
AMD Developer Central
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
AMD Developer Central
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 
Fujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu India
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
AMD Developer Central
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 

What's hot (20)

HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Fujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital World
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 

Similar to Gpu Compute

Ati Catalyst Preview
Ati Catalyst PreviewAti Catalyst Preview
Ati Catalyst Preview
guestb277b2b
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
John Holden
 
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPUNVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
Lee Bushen
 
GPU Cloud Server in India
GPU Cloud Server in IndiaGPU Cloud Server in India
GPU Cloud Server in India
Cloudtechtiq
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
 
(Winfast)quadro fx 1800_datasheet
(Winfast)quadro fx 1800_datasheet(Winfast)quadro fx 1800_datasheet
(Winfast)quadro fx 1800_datasheet
IVAN MONTES
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
Hd7800 series
Hd7800 seriesHd7800 series
Hd7800 series
PowerColor
 
Trinity press deck 10 2 2012
Trinity press deck 10 2 2012Trinity press deck 10 2 2012
Trinity press deck 10 2 2012
AMD
 
The Visual Computing Company
The Visual Computing CompanyThe Visual Computing Company
The Visual Computing Company
Grupo Texium
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
laparuma
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
AMD
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
Hd7970 sales kit
Hd7970 sales kitHd7970 sales kit
Hd7970 sales kit
PowerColor
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
Turbo duo hd7790 sales kit
Turbo duo hd7790 sales kitTurbo duo hd7790 sales kit
Turbo duo hd7790 sales kit
PowerColor
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 

Similar to Gpu Compute (20)

Ati Catalyst Preview
Ati Catalyst PreviewAti Catalyst Preview
Ati Catalyst Preview
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPUNVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
 
GPU Cloud Server in India
GPU Cloud Server in IndiaGPU Cloud Server in India
GPU Cloud Server in India
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
(Winfast)quadro fx 1800_datasheet
(Winfast)quadro fx 1800_datasheet(Winfast)quadro fx 1800_datasheet
(Winfast)quadro fx 1800_datasheet
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
Hd7800 series
Hd7800 seriesHd7800 series
Hd7800 series
 
Trinity press deck 10 2 2012
Trinity press deck 10 2 2012Trinity press deck 10 2 2012
Trinity press deck 10 2 2012
 
The Visual Computing Company
The Visual Computing CompanyThe Visual Computing Company
The Visual Computing Company
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Hd7970 sales kit
Hd7970 sales kitHd7970 sales kit
Hd7970 sales kit
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Turbo duo hd7790 sales kit
Turbo duo hd7790 sales kitTurbo duo hd7790 sales kit
Turbo duo hd7790 sales kit
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 

More from jworth

University of Texas
University of TexasUniversity of Texas
University of Texas
jworth
 
AMD's Lonestar Campus
AMD's Lonestar CampusAMD's Lonestar Campus
AMD's Lonestar Campus
jworth
 
AMD Financial Analyst\'s Day
AMD Financial Analyst\'s DayAMD Financial Analyst\'s Day
AMD Financial Analyst\'s Day
jworth
 
Accelerated Computing
Accelerated ComputingAccelerated Computing
Accelerated Computing
jworth
 
Amd V Nested Paging
Amd V Nested PagingAmd V Nested Paging
Amd V Nested Paging
jworth
 
Ces08
Ces08Ces08
Ces08
jworth
 
Virtualization
VirtualizationVirtualization
Virtualization
jworth
 

More from jworth (7)

University of Texas
University of TexasUniversity of Texas
University of Texas
 
AMD's Lonestar Campus
AMD's Lonestar CampusAMD's Lonestar Campus
AMD's Lonestar Campus
 
AMD Financial Analyst\'s Day
AMD Financial Analyst\'s DayAMD Financial Analyst\'s Day
AMD Financial Analyst\'s Day
 
Accelerated Computing
Accelerated ComputingAccelerated Computing
Accelerated Computing
 
Amd V Nested Paging
Amd V Nested PagingAmd V Nested Paging
Amd V Nested Paging
 
Ces08
Ces08Ces08
Ces08
 
Virtualization
VirtualizationVirtualization
Virtualization
 

Gpu Compute

  • 1. GPU Computing: Past, Present and Future with ATI Stream Technology Michael Monkang Chu Product Manager, ATI Stream Computing Software michael.chu@amd.com March 9, 2010
  • 2. Harnessing the Computational Power of GPUs • GPU architecture increasingly emphasizes programmable shaders instead of fixed function logic • Enormous computational capability for data parallel workloads • New math for datacenters: enables high performance/watt and Software performance/$ Applications Graphics Workloads Serial and Task Parallel Data Parallel Workloads Workloads 2 | GPU Computing – Past, Present and Future with ATI Stream Technology
  • 3. ATI Stream Technology is… Heterogeneous: Developers leverage AMD GPUs and CPUs for optimal application performance and user experience Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development High performance: Massively parallel, programmable GPU architecture enables superior performance and power efficiency Gaming Digital Content Productivity Creation Sciences Government Engineering 3 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology 3 | GPU Computing – Update | Confidential
  • 4. ATI Radeon™ X800 Architecture (2004) 4 | GPU Computing – Past, Present and Future with ATI Stream Technology
  • 5. ATI Radeon™ HD 3870 Architecture (2007) 5 | GPU Computing – Past, Present and Future with ATI Stream Technology
  • 6. GPU Compute Timeline ATI Stream SDK v0.9 ATI Stream SDK v1.0 ATI Stream SDK v2.0 Folding @Home Open systems approach Enhancements to improve First production version of Proof of concept achieving OpenCL™ for both x86 to drive broad customer computation performance >30x speedup over CPUs CPUs and AMD GPUs adoption (Brook+ & CAL/IL) Stream Computing AMD FireStream™ 9250 ATI Radeon™ HD 5870 GPU AMD FireStream™ 9170 Development Platform GPU Compute Accelerator GPU Compute Accelerator 2.72 TFLOPS - SP CTM for data parallel First GPU Stream processor with Breaks the 1 TFLOPS barrier 544 GFLOPS - DP programming double-precision Up to 8 GFOPS/watt floating point 2006 2007 2008 2009 6 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology 6 | GPU Computing – Update | Confidential
  • 7. Enhancing GPUs for Computation ATI Radeon™ HD 4870 Architecture (2008) • 800 stream processing units arranged in 10 SIMD cores • Up to 1.2 TFLOPS peak single precision floating point performance; Fast double-precision processing w/ up to 240 GFLOPS • 115 GB/sec GDDR5 memory interface • Up to 480 GB/s L1 & 384 GB/s L2 cache bandwidth • Data sharing between threads • Improved scatter/gather operations for improved GPGPU memory performance • Integer bit shift operations for all units – useful for crypto, compression, video processing • More aggressive clock gating for improved performance per watt 7 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology 7 | GPU Computing – Update | Confidential 7
  • 8. ATI Radeon™ HD 4870 Architecture GDDR5 Memory Interface Texture SIMD Units Cores UVD & Display Controllers PCI Express Bus Interface 8 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology 8 | GPU Computing – Update | Confidential 8
  • 9. GPU Compute Processing Power Trend 3000 GigaFLOPS * Cypress ATI RADEON™ HD 5870 2500 * Peak single-precision performance; For RV670, RV770 & Cypress divide by 5 for peak double-precision performance RV770 ATI RADEON™ HD 4800 RV670 ATI FirePro™ 2000 ATI RADEON™ V8700 OpenCL™ 1.0 R600 HD 3800 ATI FireGL™ AMD FireStream™ DirectX® 11 ATI RADEON™ 9250 HD 2900 V7700 2.25x Perf. R580(+) ATI FireGL™ AMD FireStream™ 9270 ATI RADEON™ <+18% TDP 1500 X19xx V7600 9170 R520 ATI FireStream™ V8600 ATI RADEON™ V8650 2.5x ALU X1800 increase ATI FireGL™ 1000 V7200 V7300 ATI Stream SDK V7350 CAL+IL/Brook+ 500 Double-precision GPGPU Unified floating point via CTM Shaders 0 08 -05 7 9 7 6 -06 8 r- 0 l-0 v -0 t-0 c-0 n- p ar Ju Ap Oc No De Se Ju M 9 | ATI Stream ComputingPast, Present and Future with ATI Stream Technology 9 | GPU Computing – Update | Confidential
  • 10. The World’s Most Efficient GPU* 14.47 GFLOPS/W GFLOPS/W GFLOPS/mm2 7.50 4.50 7.90 GFLOPS/mm2 2.01 2.21 4.56 1.07 2.24 0.42 1.06 0.92 *Based on comparison of consumer client single-GPU configurations as of 12/08/09. ATI Radeon™ HD 5870 provides 14.47 GFLOPS/W and 7.90 GFLOPS/mm2 vs. NVIDIA GTX 285 at 5.21 GFLOPS/W and 2.26 GFLOPS/mm2. 10 |10 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 11. ATI Radeon™ HD 5870 (“Cypress”) Architecture (2009) 2.72 Teraflops Single Precision, 544 Gigaflops Double Precision • Full Hardware Implementation of DirectCompute 11 and OpenCL™ 1.0 • IEEE754-2008 Compliance Enhancements • Additional Compute Features: • 32-bit Atomic Operations • Flexible 32kB Local Data Shares • 64kB Global Data Share • Global synchronization • Append/consume buffers 11 |11 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 12. SIMD Cores Each core: – Includes 80 scalar stream processing units in total + 32KB Local Data Share – Has its own control logic and runs from a shared set of threads – Has dedicated fetch unit w/ 8KB L1 cache – Communicates with other SIMD cores via 64KB global data share 12 |12 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 13. Thread Processors 2.7 TeraFLOPS Single Precision 544 GigaFLOPS Double Precision  7x more than Nvidia Tesla C1060* Increased IPC Stream Cores  More flexible dot products 4 32-bit FP MAD per clock Special functions 1 32-bit FP MAD Co-issue MUL & dependent ADD in a 2 64-bit FP MUL or ADD per clock  1 64-bit FP MAD per clock per clock single clock 4 24-bit Int MUL or ADD per clock  Sum of Absolute Differences (SAD) – 12x speed-up with native instruction – Used for video encoding, computer vision – Exposed via OpenCL extension  DirectX 11 bit-level ops – Bit count, insert, extract, etc.  Fused Multiply-Add  Each Thread Processor includes:  Improved IEEE-754 FP compliance  4 Stream Cores + 1 Special Function – All rounding modes Stream Core  Branch Unit – FMA (Cypress only)  General Purpose Registers – Denorms (Cypress only) – Flags * Based on published figure of 78 GigaFLOPS 13 |13 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 14. Memory Hierarchy Optimized memory controller area EDC (Error Detection Code)  CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds GDDR5 Memory Clock temperature compensation  Enables Speeds Approaching 5 Gbps Fast GDDR5 Link Retraining  Allows Voltage & Clock Switching on the Fly without Glitches Increased texture bandwidth  Up to 68 billion bilinear filtered texels/sec  Up to 272 billion 32-bit fetches/sec Increased cache bandwidth  Up to 1 TB/sec L1 texture fetch bandwidth  Up to 435 GB/sec between L1 & L2 Doubled L2 cache  128kB per memory controller New DirectX 11 texture features  16k x 16k max resolution  New 32-bit and 64-bit HDR block compression modes (BC6/7) 14 |14 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 15. OpenCL™: Game-Changing Development Enabling Broad Adoption of GP-GPU Capabilities  Industry standard API: Open, multiplatform development platform for heterogeneous architectures  The power of Fusion: Leverages CPUs and GPUs for balanced system approach  Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.  Fast track development: Ratified in December 2008; AMD is the first company to provide a complete OpenCL solution  Momentum: Enormous interest from mainstream developers and application ISVs More stream-enabled applications across all markets 15 |15 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 16. ATI Stream SDK v2.01: OpenCL™ For Multicore x86 CPUs and GPUs The Power of Fusion: Developers leverage heterogeneous architecture to enable superior user experience • First complete OpenCL™ development platform • Certified OpenCL™ 1.0 compliant by the Khronos Group1 • Write code that can scale well on multi-core CPUs and GPUs • AMD delivers on the promise of support for OpenCL™, with both high-performance CPU and GPU technologies • Available for download now – includes documentation, samples, and developer support Software Applications Graphics Workloads Serial and Task Parallel Data Parallel Workloads Product Page: http://developer.amd.com/stream Workloads 1Conformance logs submitted for the ATI Radeon™ HD 5800 series GPUs, ATI Radeon™ HD 5700 series GPUs, ATI Radeon™ HD 4800 series GPUs, ATI FirePro™ V8700 series GPUs, AMD FireStream™ 9200 series GPUs, ATI Mobility Radeon™ HD 4800 series GPUs and x86 CPUs with SSE3. 16 |16 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 17. Anatomy of OpenCL™ Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error • Online or offline compilation and build of compute kernel executables • Includes a rich set of built-in functions Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources 17 |17 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 18. OpenCL™ Programming Model Execution Model • Compute kernel is basic unit of execution • Execution can occur in-order or out-of-order • Kernel can be data-parallel (GPU) or task-parallel (CPU) • N-dimensional execution domain for kernels • Ability to group work-items into work-groups for sync/comm Memory Model • Multi-level memory model: private, local, constant and global 18 |18 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 19. OpenCL™ View Constants Workgroups Local memory Image cache L2 cache Global memory 19 |19 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 20. OpenCLTM Memory space on AMD GPU Private Private Private Private Registers/LDS Memory Memory Memory Memory Work Item Work Item Thread Processor Unit Work Item 1 M Work Item 1 M Compute Unit 1 Compute Unit N SIMD Local Data Share Local Memory Local Memory Board Mem/Constant Cache Global / Constant Memory Data Cache Compute Device Board Memory Global Memory Compute Device Memory 20 |20 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 21. Example Walk Through – Kernel __kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } 21 |21 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 22. Example Walk Through – Host Code (Init) // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL); 22 |22 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 23. Example Walk Through – Host Code (Compile) // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); 23 |23 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 24. Example Walk Through – Host Code (Run) // set the args values err = clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); 24 |24 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 25. OpenCL™ Development Directions Khronos Group is working to evolve specification to support future architectural models and features  Moving compute-oriented optional features into the core specification – Double Precision, Atomics  Developing extensions to support specific applications – Video, Physics, etc.  Improving cross-platform interoperability  Tightening mathematical precisions  Developing more advanced scheduling models 25 |25 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 26. OpenCL™ Backend for HMPP (Scheduled to be released) • A compiler integrating OpenCL™ stream generator o Build portable CPU and GPU hardware specific computations • C & Fortran programming directives o High level programming interface for scientific applications • Runtime library o Ease application deployment on multi-GPUs systems C/Fortran HMPP annotated source code HMPP Compiler HMPP Preprocessor HMPP OpenCL™ Generator CPU Source OpenCL™ Source C/Fortran Std. Compiler AMD OpenCL™ Compiler HMPP Runtime AMD OpenCL™ Driver CPU ATI Stream Hardware www.caps-entreprise.com 26
  • 27. Comparing OpenCL™ and DirectX® 11 DirectCompute How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?  Feature set is similar in both APIs DirectX® 11 DirectCompute  Easiest path to add compute capabilities to existing DirectX® applications  Windows Vista® and Windows® 7 only OpenCL™  Ideal path for new applications porting to the GPU for the first time  True multiplatform: Windows®, Linux®, MacOS  Natural programming without dealing with a graphics API 27 |27 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 28. ATI Stream Technology Enabled Multimedia Applications MediaShow 5 PowerDirector 8 MediaShow Espresso PowerDirector 7 SimHD™ Plug-in for TotalMedia Theatre Roxio Creator™ 2010 Roxio Creator™ 2010 Pro 28 |28 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 29. GPU Acceleration in Technical Applications Tomographic Reconstruction: Alain Bonissent, Centre de Physique des Particules de Marseille •Reporting 42-60x* speedups •This image: 7 minutes in optimized C++; 10 seconds in Brook+ EDA Simulation: ACCIT •Currently beta testing applications and reporting >10x speedup** Seismic Processing: Brown Deer •Achieving 120x speedup vs CPU on 3D 2nd order finite-difference time-domain (FDTD) seismic processing algorithm Options Trading: Scotia Capital: • Reported a 28x speedup over a quad-core CPU. Neural Networks: Neurala Developing Neurala Technology Platform for advanced brain-based machine learning applications • Report achieving 10-200x speedups on biologically inspired neural models 29 |29 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 30. High Performance With ATI Stream Technology: Cryptography through Distributed.net Distributed.net provides a distributed model allowing users to donate compute cycles to large compute-intensive projects RC5-72 – Cryptography algorithm that searches for encryption keys RC5-72 1400 1200 1000 MKeys/sec 800 600 400 200 0 ATI Radeon ATI Radeon ATI Radeon Nvidia GEForce Nvidia GEForce Nvidia GEForce HD5870 HD4870 HD4850 GTX285 GTX280 GTX260 *Based on AMD internal testing using RC5-72 clients as of 9/04/09. Results shown in MKeys evaluated per second. Configuration: AMD Phenom™ X4 9950 Black Edition processor, 8GB DDR2 RAM, Windows Vista® 32-bit. AMD drivers: ATI Catalyst™ 9.8 (ATI Radeon™ HD 48xx), prerelease driver (ATI Radeon HD 5870). Nvidia driver: GeForce 190.62. AMD client: [x86/Stream], v2.9106.513 (beta8). Nvidia client: [x86/CUDA-2.2], v2.9105.512 (beta8). 30 |30 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 31. Application Acceleration ACML GPU Accelerated DGEMM Performance 240 220 FireStream 9270 (measured) 200 Quad-Core CPU (peak theoretical) 180 Tesla C1060 (peak theoretical) (double-precision) 160 GFLOPS 140 120 100 80 60 40 20 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 40 80 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 Matrix Dimension (square matrices) •AMD FireStream™ 9270 on AMD Phenom™ X4 9950/790FX/4GB DDR2 running RHEL 5.1 x86_64 •AMD FireStream measured performance includes transfer of operand and result matrices •Quad-Core peak theoretical performance quoted for 3.2GHz Nehalem processor •C1060 peak theoretical performance derived from published specifications •ACML-GPU library freely available from: http://developer.amd.com/gpu/acmlgpu 31 |31 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 32. NUDT’s Tianhe-1 1.206 Pflops peak - 563.1 Tflops LINPACK 6,144 Intel CPUs - 5,120 ATI RV770 GPUs 32 |32 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 33. Hybrid Parallel Gas Dynamics in OpenCLTM LANL, IBM, AMD, NVIDIA Booths at SC09 • SC09 demos on •x86 CPU – Opteron •x86 CPU – Xeon •GPU – NVIDIA •GPU – AMD •Power6 •PowerXCell Common Source!! 33 |33 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 34. A New Era of Processor Performance Heterogeneous Single-Core Era Multi-Core Era Systems Era Constrained by: Constrained by: Enabled by: Power Power  Abundant data parallelism Complexity Parallel SW availability  Power efficient GPUs Scalability Constrained by: Programming models Single-thread Performance Targeted Application ? Performance Throughput Performance we are here we are here we are here Time Time Time (# of processors) (Data-parallel exploitation) 34 |34 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 35. A New Era of Processor Performance Microprocessor Advancement CPU Single-Core Multi-Core Heterogeneous Era Era Systems Era Heterogeneous System-level Computing programmable Programmability OpenCL™/DirectX® Homogeneous driver-based Advancement programs Computing GPU Graphics driver-based programs Throughput Performance GPU 35 |35 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 36. AMD Fusion™ APUs Fill the Need x86 CPU owns GPU Optimized for the Software World Modern Workloads  Windows®, MacOS  Enormous parallel and Linux® franchises computing capacity  Thousands of apps  Outstanding performance-per -  Established programming watt-per-dollar and memory model  Very efficient  Mature tool chain hardware threading  Extensive backward  SIMD architecture well compatibility for matched to modern applications and OSs workloads: video, audio,  High barrier to entry graphics 36 |36 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 37. Heterogeneous Computing: Next-Generation Software Ecosystem Ease of application development End-user Applications High Level Frameworks Tools: HLL Advanced Optimizations compilers, & Load Balancing Load balance Debuggers, across CPUs and Middleware/Libraries: Video, Profilers GPUs; leverage Imaging, Math/Sciences, AMD Fusion™ Physics performance capabilities Drive new OpenCL™ & DirectX® Compute features into industry standards Hardware & Drivers: AMD Fusion™, Discrete CPUs/GPUs 37 |37 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 38. ONLY AMD! 38 |38 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 39. Backup Slides 39 |39 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 40. Training and Related Resources • Training Resources • Introductory Tutorial to OpenCL™ • AMD Developer Inside Track: Introduction to OpenCL™ • ATI Stream OpenCL™ Technical Overview Video Series • Porting CUDA to OpenCL™ • Image Convolution Using OpenCL™ - A Step-by-Step Tutorial • OpenCL™ Tutorial: N-Body Simulation • Related Resources • OpenCL™: The Open Standard for Parallel Programming of GPUs and Multi-core CPUs • The Khronos™ Group – OpenCL™ Overview Page • ATI Stream Profiler Product Page • ACML-GPU Product Page • ATI Stream Power Toys Product Page • ATI Stream Developer Articles & Publications • ATI Stream Developer Showcase • ATI Stream Developer Training Resources • KB75 - Tips and suggestions for running SiSoftware Sandra 2010 OpenCL™ GPGPU benchmarks • ATI Stream SDK v2.01 Documentation 40 |40 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 41. OpenCL™ vs. CUDA Feature OpenCL™ CUDA Compilation Methods Online + Offline Offline Only Mathematical Precision Well Defined Undefined Math Libraries Defined Standard Proprietary CPU Support OpenCL™ CPU Device No CPU Support Native Host Task Support Task Parallel Compute No Native Thread Model w/ Ability To Support Enqueue Native Threads Extension Mechanism Defined Mechanism Proprietary Vendor Support Industry-Wide NVIDIA Only Support AMD, Apple, etc. C Language Support Yes Yes 41 |41 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential
  • 42. Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Phenom, Fusion, Radeon, FirePro, FireStream and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, Windows Vista, and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. 42 |42 | Stream ComputingPast, Present and Future with ATI Stream Technology ATI GPU Computing – Update | Confidential