SlideShare a Scribd company logo
1 of 61
Download to read offline
Languages, APIs and Development Tools
for GPU Computing
Phillip Miller | Director of Product Management, Workstation Software

SIGGRAPH ASIA | December 16, 2010
Agenda for This Morning’s Overview
 Introduction – GPU Computing

 Options – Languages

 Assistance – Development Tools

 Approach – Application Design Patterns

 Leveraging – Libraries and Engines

 Scaling – Grid and Cluster

 Learning – Developer Resources
“GPGPU or GPU Computing”
 Using all processors in the system
  for the things they are best at doing:

   — Evolution of CPUs makes them good at sequential, serial tasks
   — Evolution of GPUs makes them good at parallel processing




                  CPU               GPU
Mathematical
                                          Libraries                      Packages




     Research & Education                                                                   Consultants, Training
                                                                                               & Certification




                                                            DirectX
        Integrated                                                                            Tools & Partners
 Development Environment
 Parallel Nsight for MS Visual Studio   GPU Computing Ecosystem

                                        Languages & API’s             All Major Platforms




                                        Fortran
NVIDIA Confidential
CUDA - NVIDIA’s Architecture for GPU Computing

Broad Adoption
 +250M CUDA-enabled
 GPUs in use                                                            GPU Computing Applications
 +650k CUDA Toolkit
 downloads in last 2 Yrs

 +350 Universities
 teaching GPU Computing                 CUDA                     OpenCL                      Direct                Fortran          Python,
 on the CUDA Architecture               C/C++                                               Compute                               Java, .NET, …
 Cross Platform:
 Linux, Windows, MacOS            +100k developers           Commercial OpenCL            Microsoft API for    PGI Accelerator    PyCUDA
                                  In production usage        Conformant Driver            GPU Computing        PGI CUDA Fortran   GPU.NET
                                  since 2008                 Publicly Available for all   Supports all CUDA-                      jCUDA
 Uses span                                                   CUDA capable GPU’s           Architecture GPUs
                                  SDK + Libs + Visual
 HPC to Consumer                  Profiler and Debugger      SDK + Visual Profiler        (DX10 and DX11)




                                                                 NVIDIA GPU
                                                                 with the CUDA Parallel Computing Architecture


   OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
GPU Computing Software Stack

                  Your GPU Computing Application


                  Application Acceleration Engines
                      Middleware, Modules & Plug-ins


                         Foundation Libraries
                       Low-level Functional Libraries

                      Development Environment
        Languages, Device APIs, Compilers, Debuggers, Profilers, etc.


                           CUDA Architecture
Options -
Languages & APIs



         © NVIDIA Corporation 2010
Language & APIs for GPU Computing

Approach                                     Examples

Application Level Integration                MATLAB, Mathematica, LabVIEW


Implicit Parallel Languages (high level)     PGI Accelerator, HMPP

Abstraction Layer or API Wrapper             PyCUDA, CUDA.NET, jCUDA

Explicit Language Integration (high level)   CUDA C/C++, PGI CUDA Fortran

Device API (low level)                       CUDA C/C++, DirectCompute, OpenCL
Example: Application Level Integration
 GPU support with MathWorks Parallel Computing Toolbox™
 and Distributed Computing Server™




                            Workstation                        Compute Cluster

MATLAB Parallel Computing Toolbox (PCT)   MATLAB Distributed Computing Server (MDCS)

• PCT enables high performance through    • MDCS allows a MATLAB PCT application to be
  parallel computing on workstations        submitted and run on a compute cluster

• NVIDIA GPU acceleration now available   • NVIDIA GPU acceleration now available
MATLAB Performance on Tesla                                                                            (previous GPU generation)

Relative Performance, Black-Scholes Demo
Compared to Single Core CPU Baseline
                                  Single Core CPU     Quad Core CPU    Single Core CPU + Tesla C1060       Quad Core CPU + Tesla C1060
                           12.0


                           10.0
Relative Execution Speed




                            8.0


                            6.0


                            4.0


                            2.0


                             -
                                              256 K                   1,024 K                    4,096 K                     16,384 K
                                                                                  Input Size



                                  Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operations
Example: Implicit Parallel Languages
                        PGI Accelerator Compilers
      SUBROUTINE SAXPY (A,X,Y,N)               compile
      INTEGER N
      REAL A,X(N),Y(N)
!$ACC REGION
      DO I = 1, N                                  Host x64 asm File                   Auto-generated GPU code
          X(I) = A*X(I) + Y(I)
                                     saxpy_:                                   typedef struct dim3{ unsigned int x,y,z; }dim3;
      ENDDO                                    …                               typedef struct uint3{ unsigned int x,y,z; }uint3;
!$ACC END REGION                               movl    (%rbx), %eax            extern uint3 const threadIdx, blockIdx;
      END                                      movl    %eax, -4(%rbp)          extern dim3 const blockDim, gridDim;
                                               call    __pgi_cu_init           static __attribute__((__global__)) void
                                                                               pgicuda(
                                               . . .


                           link
                                               call
                                               …
                                               call
                                                       __pgi_cu_function

                                                       __pgi_cu_alloc
                                                                           +       __attribute__((__shared__)) int tc,
                                                                                   __attribute__((__shared__)) int i1,
                                                                                   __attribute__((__shared__)) int i2,
                                                                                   __attribute__((__shared__)) int _n,
                                               …                                   __attribute__((__shared__)) float* _c,
                                                                                   __attribute__((__shared__)) float* _b,
                                               call    __pgi_cu_upload
                                                                                   __attribute__((__shared__)) float* _a )
                                               …                               { int i; int p1; int _i;
                                               call    __pgi_cu_call               i = blockIdx.x * 64 + threadIdx.x;
                                                                                   if( i < tc ){
                                               …
                                                                                       _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]);
                                               call    __pgi_cu_download
                                                                                       _b[i+i2-1] = _c[i+i2];
                                               …                                       _i = (_i+1);
                                                                                       p1 = (p1-1);
          Unified                                                                  } }

          a.out

                                   … no change to existing makefiles, scripts, IDEs,
      execute
                                   programming environment, etc.
Example: Abstraction Layer/Wrapper
                         PyCUDA / PyOpenCL




                                    Slide courtesy of Andreas Klö ckner, Brown University

http://mathema.tician.de/software/pycuda
Example: Language Integration
              CUDA C: C with a few keywords

 void saxpy_serial(int n, float a, float *x, float *y)
 {
     for (int i = 0; i < n; ++i)
         y[i] = a*x[i] + y[i];
 }
                                                      Standard   C Code
 // Invoke serial SAXPY kernel
 saxpy_serial(n, 2.0, x, y);

 __global__ void saxpy_parallel(int n, float a, float *x, float *y)
 {
     int i = blockIdx.x*blockDim.x + threadIdx.x;
     if (i < n) y[i] = a*x[i] + y[i];                    CUDA C Code
 }
 // Invoke parallel SAXPY kernel with 256 threads/block
 int nblocks = (n + 255) / 256;
 saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Example: Low-level Device API
         OpenCL
 Cross-vendor open standard
   — Managed by the Khronos Group

 Low-level API for device management and                        http://www.khronos.org/opencl

  launching kernels
   — Close-to-the-metal programming interface
   — JIT compilation of kernel programs

 C-based language for compute kernels
   — Kernels must be optimized for each processor architecture

     NVIDIA released the first OpenCL conformant driver for
     Windows and Linux to thousands of developers in June 2009
Example: Low-level Device API
         Direct Compute
 Microsoft standard for all GPU vendors
   — Released with DirectX® 11 / Windows 7
   — Runs on all +100M CUDA-enabled DirectX 10 class GPUs and later


 Low-level API for device management and launching kernels
   — Good integration with DirectX 10 and 11


 Defines HLSL-based language for compute shaders
   — Kernels must be optimized for each processor architecture
Example: New Approach


 Write GPU kernels in C#, F#, VB.NET, etc.
 Exposes a minimal API accessible from any
  .NET-based language
   — Learn a new API instead of a new language

 JIT compilation = dynamic language support

 Don’t rewrite your existing code
   — Just give it a ―touch-up‖
Language & APIs for GPU Computing

Approach                                     Examples

Application Level Integration                MATLAB, Mathematica, LabVIEW


Implicit Parallel Languages (high level)     PGI Accelerator, HMPP

Abstraction Layer or API Wrapper             PyCUDA, CUDA.NET, jCUDA

Explicit Language Integration (high level)   CUDA C/C++, PGI CUDA Fortran

Device API (low level)                       CUDA C/C++, DirectCompute, OpenCL
Assistance -
Development Tools



         © NVIDIA Corporation 2010
Parallel Nsight for Visual Studio
Integrated development for CPU and GPU




             Build   Debug   Profile
Windows GPU Development for 2010
NVIDIA Parallel Nsight ™ 1.5




     nvcc            FX Composer


   cuda-gdb        Shader Debugger


 cuda-memcheck         PerfHUD


 Visual Profiler      ShaderPerf



   cudaprof        Platform Analyzer
4 Flexible GPU Development Configurations
Desktop           Single machine, Single NVIDIA GPU
                  Analyzer, Graphics Inspector

                  Single machine, Dual NVIDIA GPUs
                  Analyzer, Graphics Inspector, Compute Debugger


Networked         Two machines connected over the network
                  Analyzer, Graphics Inspector, Compute Debugger, Graphics Debugger
    TCP/IP



Workstation SLI   SLI Multi OS workstation with multiple Quadro GPUs
                  Analyzer, Graphics Inspector, Compute Debugger, Graphics Debugger


                                                                         © NVIDIA Corporation 2010
Linux: NVIDIA cuda-gdb

CUDA debugging integrated
into GDB on Linux

  Supported on 32bit & 64bit Linux,
  MacOS to come.
  Seamlessly debug both the
  host/CPU and device/GPU code
  Set breakpoints on any source
  line or symbol name
  Access and print all CUDA
  memory allocs, local, global,
  constant and shared vars            Parallel Source
                                        Debugging


  Included in the CUDA Toolkit
3rd Party: DDT debugger
                      Latest News from Allinea
                           CUDA SDK 3.0 with DDT 2.6
                                Released June 2010
                                Fermi and Tesla support
                                cuda-memcheck support for memory errors
                                Combined MPI and CUDA support
                                Stop on kernel launch feature
                                Kernel thread control, evaluation and
                                 breakpoints
                                Identify thread counts, ranges and CPU/GPU
                                 threads easily
                           SDK 3.1 in beta with DDT 2.6.1
                           SDK 3.2
                                Coming soon: multiple GPU device support
3rd Party: TotalView Debugger
                     Latest from TotalView debugger (in Beta)
                       —   Debugging of application running on the GPU device

                       —   Full visibility of both Linux threads and GPU device threads

                                 Device threads shown as part of the parent Unix process

                                 Correctly handle all the differences between the CPU and GPU

                       —   Fully represent the hierarchical memory

                                 Display data at any level (registers, local, block, global or host memory)

                                 Making it clear where data resides with type qualification

                       —   Thread and Block Coordinates
                                 Built in runtime variables display threads in a warp, block and thread
                                  dimensions and indexes

                                 Displayed on the interface in the status bar, thread tab and stack frame

                       —   Device thread control
                                 Warps advance Synchronously

                       —   Handles CUDA function inlining
                                 Step in to or over inlined functions

                       —   Reports memory access errors
                                 CUDA memcheck

                       —   Can be used with MPI
NVIDIA Visual Profiler
 Analyze GPU HW performance
 signals, kernel occupancy,
 instruction throughput, and more
 Highly configurable
 tables and graphical views
 Save/load profiler sessions or
 export to CSV for later analysis
 Compare results visually
 across multiple sessions to
 see improvements
 Windows, Linux and Mac OS X
 OpenCL support on Windows and Linux


 Included in the CUDA Toolkit
GPU Computing SDK
Hundreds of code samples for
CUDA C, DirectCompute and OpenCL
   Finance
   Oil & Gas
   Video/Image Processing
   3D Volume Rendering
   Particle Simulations
   Fluid Simulations
   Math Functions
Approach -
Design Patterns



  © 2009 NVIDIA Corporation
Accelerating Existing Applications
                                     Profile for Bottlenecks,
  Identify Possibilities             Inspect for Parallelism

                              A Debugger is a good starting point,
 Port Relevant Portion
                            Consider Libraries & Engines vs. Custom


     Validate Gains                Benchmark vs. CPU version


                                 Parallel Nsight, Visual Profiler,
        Optimize
                                     GDB, Tau CUDA, etc.


         Deploy            Maintain original as CPU fallback if desired.
Trivial Application (Accelerating a Process)
Design Rules:
  Serial task processing on CPU
  Data Parallel processing on GPU
       Copy input data to GPU
       Perform parallel processing
       Copy results back                              Application

                                          CPU C Runtime       CUDA CFortran
                                                              CUDA Driver API
                                                               OpenCL Driver
                                                               CUDA Runtime
                                                                CUDA.NET
                                                                 PyCUDA
  Follow guidance in the
  CUDA C Best Practices Guide
                                                   CPU                   GPU
                                          CPU                  GPU
                                                  Memory                Memory



The CUDA C Runtime could be substituted
with other methods of accessing the GPU
Basic Application – using multi-GPU
“Trivial Application” plus:
  Maximize overlap of data transfers and computation
  Minimize communication required between processors
  Use one CPU thread to manage each GPU

                                                        Application

                                     CPU C Runtime            CUDA C Runtime



                                                CPU              GPU            GPU
                                         CPU            GPU              GPU
                                               Memory           Memory         Memory

Multi-GPU notebook, desktop,
workstation and cluster node
configurations are increasingly common
Graphics Application
“Basic Application” plus:
  Use graphics interop to avoid unnecessary copies
  In Multi-GPU systems, put buffers to be displayed in GPU
  Memory of GPU attached to the display

                                        Application

                     CPU C Runtime     CUDA C Runtime       OpenGL / Direct3D



                               CPU                       GPU
                      CPU                  GPU
                              Memory                    Memory
Basic Library
“Basic Application” plus:
  Avoid unnecessary memory transfers
        Use data already in GPU memory
        Create and leave data in GPU memory


                                                          Library

                                        CPU C Runtime               CUDA C Runtime



                                                  CPU                         GPU
                                         CPU                        GPU
                                                 Memory                      Memory




These rules apply to plug-ins as well
Application with Plug-ins

“Basic Application” plus:
  Plug-in Mgr
     Allows Application and Plug-ins
     to (re)use same GPU memory                      Application
     Multi-GPU aware                                     Plug-in Mgr

                                               Plug-in     Plug-in    Plug-in
  Follow “Basic Library” rules
  for the Plug-ins                     CPU C Runtime                 CUDA C Runtime



                                                 CPU                             GPU
                                       CPU                           GPU
                                                Memory                          Memory
Multi-GPU Cluster Application
                                                                                        Application


“Basic Application” plus:
                                                                       CPU C Runtime          CUDA C Runtime

                                                                                CPU              GPU            GPU
                                                                       CPU              GPU     Memory
                                                                                                         GPU   Memory
                                                                               Memory



  Use Shared Memory for




                                 MPI over Ethernet, Infiniband, etc.
  intra-node communication                                                              Application

                                                                       CPU C Runtime          CUDA C Runtime
     or pthreads, OpenMP, etc.
                                                                                CPU              GPU            GPU
                                                                       CPU              GPU     Memory
                                                                                                         GPU   Memory
                                                                               Memory


  Use MPI to communicate
  between nodes                                                                         Application

                                                                       CPU C Runtime          CUDA C Runtime

                                                                                CPU              GPU            GPU
                                                                       CPU              GPU     Memory
                                                                                                         GPU   Memory
                                                                               Memory
Leveraging –
Libraries & Engines



   © 2009 NVIDIA Corporation
GPU Computing Software Stack

          Your GPU Computing Application


          Application Acceleration Engines
              Middleware, Modules & Plug-ins


                 Foundation Libraries
               Low-level Functional Libraries

              Development Environment
Languages, Device APIs, Compilers, Debuggers, Profilers, etc.


                   CUDA Architecture


                                                © 2009 NVIDIA Corporation
CUFFT Library 3.2:                                                         Improving Radix-3, -5, -7
                           Radix-3 (SP, ECC off)                                                  Radix-3 (DP, ECC off )
         250                                                                     70

                                                                                 60
         200
                                                                                 50

         150
GFLOPS




                                                                        GFLOPS
                                                                                 40
                                                    C2070 R3.2                                                                 C2070 R3.2
                                                    C2070 R3.1                   30                                            C2070 R3.1
         100
                                                    MKL                                                                        MKL
                                                                                 20
         50
                                                                                 10

           0                                                                      0
               1   2   3   4   5   6   7   8   9    10 11 12 13 14 15                 1   2   3   4   5   6   7   8   9 10 11 12 13 14 15
                                       log3(size)                                                             log3(size)

         Radix-5, -7 and mixed radix improvements not shown

         CUFFT 3.2 & 3.1 on NVIDIA Tesla C2070 GPU
         MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
CUBLAS Library 3.2                                                                 performance gains
                  12x

                                    Up to 2x average speedup over CUBLAS 3.1
                  10x


                   8x
Speedup vs. MKL




                                                                                          Less variation in performance
                                                                                          for different dimensions vs. 3.1
                   6x


                   4x


                   2x
                                                                                                                             MKL
                                                                                                                             v3.1
                   0x
                                                                                                                             v3.2
                    1024           2048          3072            4096              5120       6144              7168
                                                         Matrix dimensions (NxN)

                  Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}
                  CUFFT 3.2 & 3.1 on NVIDIA Tesla C2050 GPU
                  MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
3rd Party Example: CULA    (LAPACK for heterogeneous systems)


                                        GPU Accelerated
                                         Linear Algebra


 “CULAPACK” Library       Partnership

 » Dense linear algebra   Developed in
 » C/C++ & FORTRAN        partnership with
 » 150+ Routines          NVIDIA

 MATLAB Interface         Supercomputer Speeds

 » 15+ functions          Performance 7x of
 » Up to 10x speedup      Intel’s MKL LAPACK
CULA Library 2.2                                            Performance
 Supercomputing Speeds
  This graph shows the relative speed of many CULA functions when compared to
  Intel’s MKL 10.2. Benchmarks were obtained comparing an NVIDIA Tesla C2050
  (Fermi) and an Intel Core i7 860. More at www.culatools.com
CUSparse Library:                        Matrix Performance vs. CPU
                  Multiplication of a sparse matrix by multiple vectors
35x

30x

25x

20x

15x

10x
                                                                          "Non-transposed"
 5x                                                                       "Transposed"
 0x                                                                       MKL 10.2




 Average speedup across S,D,C,Z

 CUSPARSE 3.2 on NVIDIA Tesla C2050 GPU
 MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
CURan Libray:                        Random Number Generation
                    Generating 100K Sobol' Samples - GPU vs. CPU
 25x


 20x


 15x


 10x                                                                  CURAND 3.2
                                                                      MKL 10.2

  5x


  0x
             SP                 DP                SP             DP

                    Uniform                             Normal
  CURAND 3.2 on NVIDIA Tesla C2050 GPU
  MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
NAG GPU Library
 Monte Carlo related
     L’Ecuyer, Sobol RNGs
     Distributions, Brownian Bridge
 Coming soon
     Mersenne Twister RNG
     Optimization, PDEs
 Seeking input from the community
 For up-to-date information:
  www.nag.com/numeric/gpus

                                            43
NVPP Library:                         Graphics Performance Primitives
   Similar to Intel IPP focused on                                                       Aggregate Performance Results
                                                                                            NPP Performance Suite Grand Totals

    image and video processing                                        12


                                                                      10


   6x - 10x average speedup vs. IPP




                                            Relative Agregate Speed
                                                                      8

       — 2800 performance tests
                                                                      6


                                                                      4
   Core i7   (new)   vs. GTX 285   (old)
                                                                      2


                                                                      0
   Now available with CUDA Toolkit                                        Core2Duo t=1   Core2Duo t=2   Nehalem t=1   Nehalem t=8   Geforce 9800   Geforce GTX
                                                                                                                                        GTX+            285
                                                                                                                Processor




www.nvidia.com/npp
OpenVIDIA
 Open source, supported by NVIDIA
 Computer Vision Workbench (CVWB)
     GPU imaging & computer vision

     Demonstrates most commonly used image
     processing primitives on CUDA

     Demos, code & tutorials/information




    http://openvidia.sourceforge.net
More Open Source Projects

  Thrust: Library of parallel algorithms
  with high-level STL-like interface                        http://code.google.com/p/thrust




  OpenCurrent: C++ library for solving PDE’s
  over regular grids http://code.google.com/p/opencurrent

  200+ projects on Google Code & SourceForge
       Search for CUDA, OpenCL, GPGPU
GPU Computing Software Stack

          Your GPU Computing Application


          Application Acceleration Engines
              Middleware, Modules & Plug-ins


                 Foundation Libraries
               Low-level Functional Libraries

              Development Environment
Languages, Device APIs, Compilers, Debuggers, Profilers, etc.


                   CUDA Architecture
NVIDIA PhysX™ -          the World’s Most Deployed Physics API


 Major PhysX
 Site Licensees


  Integrated in Major       Cross Platform      Middleware & Tool
     Game Engines              Support          Integration (APEX)

    UE3        Diesel
                                                  SpeedTree      3ds Max
  Gamebryo    Unity 3d
                                                Natural Motion     Maya
   Vision      Hero
                                                Fork Particles   Softimage
   Instinct   BigWorld
                                                 Emotion FX
   Trinigy
Scaling – Grid &
Cluster Mngmnt.



   © 2009 NVIDIA Corporation
GPU Management & Monitoring
NVIDIA Systems Management Interface (nvidia-smi)

 Products          Features
                   • List of GPUs
                   • Product ID
 All GPUs          • GPU Utilization
                   • PCI Address to Device Enumeration
                   • Exclusive use mode
                   • ECC error count & location (Fermi only)
                   • GPU temperature
                   • Unit fan speeds
 Server products   • PSU voltage/current
                   • LED state
                   • Serial number
                   • Firmware version


Use CUDA_VISIBLE_DEVICES to assign GPUs to process


                                                               © 2009 NVIDIA Corporation
Bright Cluster Manager

   Most Advanced Cluster Management Solution for GPU
   clusters

   Includes:
    NVIDIA CUDA, OpenCL libraries and GPU drivers
    Automatic sampling of all available NVIDIA GPU metrics
    Flexible graphing of GPU metrics against time
    Visualization of GPU metrics in Rackview
    Powerful cluster automation, setting alerts, alarms and actions
      when GPU metrics exceed set thresholds
    Health checking framework based on GPU metrics
    Support for all Tesla GPU cards and GPU Computing Systems,
      including the most recent “Fermi” models

                                                                       52
Symphony Architecture and GPU
                                                                                               Compute Hosts

                              Management Hosts




                                                                                C++ API
                                                                 Service                     Service




                                                                                                                     CUDA Libraries
                                                                Instance                    Instance
                                   Symphony                     Manager                    (GPU aware)
      Clients                   Repository Service
                                                                                                                                        GPU 1
                              x64 Host Computer with             Service




                                                                                .NETAPI
        Excel
                   COM API
                                   GPU support                                               Service
                                                                Instance
     Spreadsheet                Symphony Service                                            Instance
                                                                Manager                    (GPU aware)
        Model                       Director                                                                                            GPU 2

                                                                                x64 Host Computer with GPU support




                                                                                Java API
                                                                 Service              Service




                                                                                                                                             dual quad-core CPUs
       Client                                                                        Service
                                                                             x64 Host Computer with GPU support
                   Java API




                                                                Instance             Instance
                                                                           x64 Host Computer with GPU support
                                                                                    Instance
     Application                                                Manager                    (GPU aware)
                              x64 Host Computer with                                          API
       Java                      Session Manager
                                   GPU support
                                                                 Service                     Service




                                                                                C++ API
                                                                                            Service
       Client                                                   Instance                    Instance
                   C++ API




                                                                                           Instance
                                                                Manager                    (GPU aware)
     Application
        C++                   x64 Host Computer with
                                  Session Manager
                                   GPU support                                                   Host OS
                   .NET API




       Client
     Application                                                               Computer with GPU support
         C#


                                               EGO – Resource aware orchestration layer


53                                                                                                  Copyright © 2010 Platform Computing Corporation. All Rights Reserved.
Selecting GPGPU Nodes
Learning –
Developer Resources




           © NVIDIA Corporation 2010
NVIDIA Developer Resources
DEVELOPMENT                      SDKs AND                          VIDEO                                ENGINES &
TOOLS                            CODE SAMPLES                      LIBRARIES                            LIBRARIES
CUDA Toolkit                     GPU Computing SDK                 Video Decode Acceleration            Math Libraries
 Complete GPU computing           CUDA C, OpenCL, DirectCompute      NVCUVID / NVCUVENC                  CUFFT, CUBLAS, CUSPARSE,
 development kit                  code samples and documentation     DXVA                                CURAND, …
                                                                     Win7 MFT
cuda-gdb                         Graphics SDK                                                           NPP Image Libraries
 GPU hardware debugging           DirectX & OpenGL code samples    Video Encode Acceleration             Performance primitives
                                                                     NVCUVENC                            for imaging
Visual Profiler                  PhysX SDK                           Win7 MFT
 GPU hardware profiler for        Complete game physics solution                                        App Acceleration Engines
 CUDA C and OpenCL               OpenAutomate                      Post Processing                       Optimized software modules
                                                                    Noise reduction / De-interlace/      for GPU acceleration
Parallel Nsight                   SDK for test automation           Polyphase scaling / Color process
 Integrated development                                                                                 Shader Library
 environment for Visual Studio                                                                            Shader and post processing
NVPerfKit                                                                                               Optimization Guides
 OpenGL|D3D performance tools                                                                            Best Practices for
                                                                                                         GPU computing and
FX Composer                                                                                              Graphics development
 Shader Authoring IDE




http://developer.nvidia.com
4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian)




10 Published books with 4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian
Google Scholar
GPU Computing Research & Education
World Class Research                          Proven Research Vision                   Quality GPGPU Teaching
Leadership and Teaching                        Launched June 1st                        Launched June 1st
 University of Cambridge                       with 5 premiere Centers                   with 7 premiere Centers
 Harvard University                            and more in review                        and more in review
 University of Utah
 University of Tennessee                      John Hopkins University , USA            McMaster University, Canada
 University of Maryland                       Nanyan University, Singapore             Potsdam, USA
 University of Illinois at Urbana-Champaign   Technical University of Ostrava, Czech   UNC-Charlotte,USA
 Tsinghua University                          CSIRO, Australia                         Cal Poly San Luis Obispo, USA
 Tokyo Institute of Technology                SINTEF, Norway                           ITESM, Mexico
 Chinese Academy of Sciences                                                           Czech Technical University, Prague, Czech
 National Taiwan University                                                            Qingdao University, China

Premier Academic Partners                     Exclusive Events, Latest HW, Discounts   Teaching Kits, Discounts, Training

                                                              NV Research                              Education
   Academic Partnerships / Fellowships
                                                      http://research.nvidia.com                    350+ Universities




       Supporting 100’s of Researchers
         around the globe ever year
Thank You!




             Thank you!
Database Application
   Minimize network communication
   Move the analysis “upstream”                    Client Application
   to stored procedures                                    or
                                                   Application Server
   Treat each stored procedure
   like a “Basic Application”
   in the DB itself
                                           Database Engine      Stored Procedure
   An App Server could also be
   a “Basic Application”                   CPU C Runtime       CUDA C Runtime

   A dedicated Client could also be
   a “Basic Application”                   CPU
                                                     CPU
                                                    Memory
                                                                GPU
                                                                            GPU
                                                                           Memory




Data Mining, Business Intelligence, etc.

More Related Content

What's hot

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”HSA Foundation
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012Mark Kilgard
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...HSA Foundation
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 

What's hot (20)

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Cuda
CudaCuda
Cuda
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Abc2011 yagi
Abc2011 yagiAbc2011 yagi
Abc2011 yagi
 
Basics of building a blackfin application
Basics of building a blackfin applicationBasics of building a blackfin application
Basics of building a blackfin application
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
CUDA
CUDACUDA
CUDA
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Hardware Accelerated 2D Rendering for Android
Hardware Accelerated 2D Rendering for AndroidHardware Accelerated 2D Rendering for Android
Hardware Accelerated 2D Rendering for Android
 
SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : Notes
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 

Viewers also liked

Epic research malaysia weekly klse report from 15th june 2015 to 19th june ...
Epic research malaysia   weekly klse report from 15th june 2015 to 19th june ...Epic research malaysia   weekly klse report from 15th june 2015 to 19th june ...
Epic research malaysia weekly klse report from 15th june 2015 to 19th june ...Epic Research Pte. Ltd.
 
Junior club opening presentation
Junior club   opening presentationJunior club   opening presentation
Junior club opening presentationPwC_JuniorClub
 
Mock Team Activity
Mock Team ActivityMock Team Activity
Mock Team ActivityBlair E
 
Stud-Welders-repair
Stud-Welders-repairStud-Welders-repair
Stud-Welders-repairRobert Jones
 
อ.ทรงศักดิ์ ออม
อ.ทรงศักดิ์   ออมอ.ทรงศักดิ์   ออม
อ.ทรงศักดิ์ ออมManassanan Tubtimsai
 
21146112 이정은
21146112 이정은21146112 이정은
21146112 이정은정은 이
 
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...Trading Green Ltd.
 
Chemical reactions and equations
Chemical reactions and equationsChemical reactions and equations
Chemical reactions and equationsUmang Gusain
 
The Montford Point Marines
The Montford Point MarinesThe Montford Point Marines
The Montford Point Marinesmgysgtcarr
 
The Einstein Code - programming for kids
The Einstein Code - programming for kidsThe Einstein Code - programming for kids
The Einstein Code - programming for kidsCentrul Doxis
 
Business Final Project (Charity drive event) Report
Business Final Project (Charity drive event) ReportBusiness Final Project (Charity drive event) Report
Business Final Project (Charity drive event) ReportEuxuan Ong
 
XcellHost Cloud Erp Presentation
XcellHost Cloud Erp PresentationXcellHost Cloud Erp Presentation
XcellHost Cloud Erp PresentationSamir Jhaveri
 
How to teach quality improvement
How to teach quality improvementHow to teach quality improvement
How to teach quality improvementacgrgurich
 
Depicting social television on Instagram: Visual social media, participation,...
Depicting social television on Instagram: Visual social media, participation,...Depicting social television on Instagram: Visual social media, participation,...
Depicting social television on Instagram: Visual social media, participation,...Tim Highfield
 
The Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big DataThe Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big DataReadWrite
 

Viewers also liked (19)

Epic research malaysia weekly klse report from 15th june 2015 to 19th june ...
Epic research malaysia   weekly klse report from 15th june 2015 to 19th june ...Epic research malaysia   weekly klse report from 15th june 2015 to 19th june ...
Epic research malaysia weekly klse report from 15th june 2015 to 19th june ...
 
Junior club opening presentation
Junior club   opening presentationJunior club   opening presentation
Junior club opening presentation
 
Ref. paper 3
Ref. paper 3Ref. paper 3
Ref. paper 3
 
Arup in Spain
Arup in SpainArup in Spain
Arup in Spain
 
Find Your Way
Find Your WayFind Your Way
Find Your Way
 
Mock Team Activity
Mock Team ActivityMock Team Activity
Mock Team Activity
 
Stud-Welders-repair
Stud-Welders-repairStud-Welders-repair
Stud-Welders-repair
 
อ.ทรงศักดิ์ ออม
อ.ทรงศักดิ์   ออมอ.ทรงศักดิ์   ออม
อ.ทรงศักดิ์ ออม
 
21146112 이정은
21146112 이정은21146112 이정은
21146112 이정은
 
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...
Acoustic Ultraline: High-Performance Noise Reduction, Natural Fiber, Durable,...
 
Chemical reactions and equations
Chemical reactions and equationsChemical reactions and equations
Chemical reactions and equations
 
The Montford Point Marines
The Montford Point MarinesThe Montford Point Marines
The Montford Point Marines
 
The Einstein Code - programming for kids
The Einstein Code - programming for kidsThe Einstein Code - programming for kids
The Einstein Code - programming for kids
 
Business Final Project (Charity drive event) Report
Business Final Project (Charity drive event) ReportBusiness Final Project (Charity drive event) Report
Business Final Project (Charity drive event) Report
 
XcellHost Cloud Erp Presentation
XcellHost Cloud Erp PresentationXcellHost Cloud Erp Presentation
XcellHost Cloud Erp Presentation
 
How to teach quality improvement
How to teach quality improvementHow to teach quality improvement
How to teach quality improvement
 
Depicting social television on Instagram: Visual social media, participation,...
Depicting social television on Instagram: Visual social media, participation,...Depicting social television on Instagram: Visual social media, participation,...
Depicting social television on Instagram: Visual social media, participation,...
 
Cynthia griffin
Cynthia griffinCynthia griffin
Cynthia griffin
 
The Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big DataThe Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big Data
 

Similar to [01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools

Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsightlaparuma
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101John Holden
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems..."Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...Edge AI and Vision Alliance
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCinside-BigData.com
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PXNVIDIA Japan
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 

Similar to [01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools (20)

Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems..."Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
 
Cuda lab manual
Cuda lab manualCuda lab manual
Cuda lab manual
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 

[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools

  • 1. Languages, APIs and Development Tools for GPU Computing Phillip Miller | Director of Product Management, Workstation Software SIGGRAPH ASIA | December 16, 2010
  • 2. Agenda for This Morning’s Overview  Introduction – GPU Computing  Options – Languages  Assistance – Development Tools  Approach – Application Design Patterns  Leveraging – Libraries and Engines  Scaling – Grid and Cluster  Learning – Developer Resources
  • 3. “GPGPU or GPU Computing”  Using all processors in the system for the things they are best at doing: — Evolution of CPUs makes them good at sequential, serial tasks — Evolution of GPUs makes them good at parallel processing CPU GPU
  • 4. Mathematical Libraries Packages Research & Education Consultants, Training & Certification DirectX Integrated Tools & Partners Development Environment Parallel Nsight for MS Visual Studio GPU Computing Ecosystem Languages & API’s All Major Platforms Fortran NVIDIA Confidential
  • 5. CUDA - NVIDIA’s Architecture for GPU Computing Broad Adoption +250M CUDA-enabled GPUs in use GPU Computing Applications +650k CUDA Toolkit downloads in last 2 Yrs +350 Universities teaching GPU Computing CUDA OpenCL Direct Fortran Python, on the CUDA Architecture C/C++ Compute Java, .NET, … Cross Platform: Linux, Windows, MacOS +100k developers Commercial OpenCL Microsoft API for PGI Accelerator PyCUDA In production usage Conformant Driver GPU Computing PGI CUDA Fortran GPU.NET since 2008 Publicly Available for all Supports all CUDA- jCUDA Uses span CUDA capable GPU’s Architecture GPUs SDK + Libs + Visual HPC to Consumer Profiler and Debugger SDK + Visual Profiler (DX10 and DX11) NVIDIA GPU with the CUDA Parallel Computing Architecture OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
  • 6. GPU Computing Software Stack Your GPU Computing Application Application Acceleration Engines Middleware, Modules & Plug-ins Foundation Libraries Low-level Functional Libraries Development Environment Languages, Device APIs, Compilers, Debuggers, Profilers, etc. CUDA Architecture
  • 7. Options - Languages & APIs © NVIDIA Corporation 2010
  • 8. Language & APIs for GPU Computing Approach Examples Application Level Integration MATLAB, Mathematica, LabVIEW Implicit Parallel Languages (high level) PGI Accelerator, HMPP Abstraction Layer or API Wrapper PyCUDA, CUDA.NET, jCUDA Explicit Language Integration (high level) CUDA C/C++, PGI CUDA Fortran Device API (low level) CUDA C/C++, DirectCompute, OpenCL
  • 9. Example: Application Level Integration GPU support with MathWorks Parallel Computing Toolbox™ and Distributed Computing Server™ Workstation Compute Cluster MATLAB Parallel Computing Toolbox (PCT) MATLAB Distributed Computing Server (MDCS) • PCT enables high performance through • MDCS allows a MATLAB PCT application to be parallel computing on workstations submitted and run on a compute cluster • NVIDIA GPU acceleration now available • NVIDIA GPU acceleration now available
  • 10. MATLAB Performance on Tesla (previous GPU generation) Relative Performance, Black-Scholes Demo Compared to Single Core CPU Baseline Single Core CPU Quad Core CPU Single Core CPU + Tesla C1060 Quad Core CPU + Tesla C1060 12.0 10.0 Relative Execution Speed 8.0 6.0 4.0 2.0 - 256 K 1,024 K 4,096 K 16,384 K Input Size Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operations
  • 11. Example: Implicit Parallel Languages PGI Accelerator Compilers SUBROUTINE SAXPY (A,X,Y,N) compile INTEGER N REAL A,X(N),Y(N) !$ACC REGION DO I = 1, N Host x64 asm File Auto-generated GPU code X(I) = A*X(I) + Y(I) saxpy_: typedef struct dim3{ unsigned int x,y,z; }dim3; ENDDO … typedef struct uint3{ unsigned int x,y,z; }uint3; !$ACC END REGION movl (%rbx), %eax extern uint3 const threadIdx, blockIdx; END movl %eax, -4(%rbp) extern dim3 const blockDim, gridDim; call __pgi_cu_init static __attribute__((__global__)) void pgicuda( . . . link call … call __pgi_cu_function __pgi_cu_alloc + __attribute__((__shared__)) int tc, __attribute__((__shared__)) int i1, __attribute__((__shared__)) int i2, __attribute__((__shared__)) int _n, … __attribute__((__shared__)) float* _c, __attribute__((__shared__)) float* _b, call __pgi_cu_upload __attribute__((__shared__)) float* _a ) … { int i; int p1; int _i; call __pgi_cu_call i = blockIdx.x * 64 + threadIdx.x; if( i < tc ){ … _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]); call __pgi_cu_download _b[i+i2-1] = _c[i+i2]; … _i = (_i+1); p1 = (p1-1); Unified } } a.out … no change to existing makefiles, scripts, IDEs, execute programming environment, etc.
  • 12. Example: Abstraction Layer/Wrapper PyCUDA / PyOpenCL Slide courtesy of Andreas Klö ckner, Brown University http://mathema.tician.de/software/pycuda
  • 13. Example: Language Integration CUDA C: C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Standard C Code // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; CUDA C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
  • 14. Example: Low-level Device API OpenCL  Cross-vendor open standard — Managed by the Khronos Group  Low-level API for device management and http://www.khronos.org/opencl launching kernels — Close-to-the-metal programming interface — JIT compilation of kernel programs  C-based language for compute kernels — Kernels must be optimized for each processor architecture NVIDIA released the first OpenCL conformant driver for Windows and Linux to thousands of developers in June 2009
  • 15. Example: Low-level Device API Direct Compute  Microsoft standard for all GPU vendors — Released with DirectX® 11 / Windows 7 — Runs on all +100M CUDA-enabled DirectX 10 class GPUs and later  Low-level API for device management and launching kernels — Good integration with DirectX 10 and 11  Defines HLSL-based language for compute shaders — Kernels must be optimized for each processor architecture
  • 16. Example: New Approach  Write GPU kernels in C#, F#, VB.NET, etc.  Exposes a minimal API accessible from any .NET-based language — Learn a new API instead of a new language  JIT compilation = dynamic language support  Don’t rewrite your existing code — Just give it a ―touch-up‖
  • 17. Language & APIs for GPU Computing Approach Examples Application Level Integration MATLAB, Mathematica, LabVIEW Implicit Parallel Languages (high level) PGI Accelerator, HMPP Abstraction Layer or API Wrapper PyCUDA, CUDA.NET, jCUDA Explicit Language Integration (high level) CUDA C/C++, PGI CUDA Fortran Device API (low level) CUDA C/C++, DirectCompute, OpenCL
  • 18. Assistance - Development Tools © NVIDIA Corporation 2010
  • 19. Parallel Nsight for Visual Studio Integrated development for CPU and GPU Build Debug Profile
  • 20. Windows GPU Development for 2010 NVIDIA Parallel Nsight ™ 1.5 nvcc FX Composer cuda-gdb Shader Debugger cuda-memcheck PerfHUD Visual Profiler ShaderPerf cudaprof Platform Analyzer
  • 21. 4 Flexible GPU Development Configurations Desktop Single machine, Single NVIDIA GPU Analyzer, Graphics Inspector Single machine, Dual NVIDIA GPUs Analyzer, Graphics Inspector, Compute Debugger Networked Two machines connected over the network Analyzer, Graphics Inspector, Compute Debugger, Graphics Debugger TCP/IP Workstation SLI SLI Multi OS workstation with multiple Quadro GPUs Analyzer, Graphics Inspector, Compute Debugger, Graphics Debugger © NVIDIA Corporation 2010
  • 22. Linux: NVIDIA cuda-gdb CUDA debugging integrated into GDB on Linux Supported on 32bit & 64bit Linux, MacOS to come. Seamlessly debug both the host/CPU and device/GPU code Set breakpoints on any source line or symbol name Access and print all CUDA memory allocs, local, global, constant and shared vars Parallel Source Debugging Included in the CUDA Toolkit
  • 23. 3rd Party: DDT debugger  Latest News from Allinea  CUDA SDK 3.0 with DDT 2.6  Released June 2010  Fermi and Tesla support  cuda-memcheck support for memory errors  Combined MPI and CUDA support  Stop on kernel launch feature  Kernel thread control, evaluation and breakpoints  Identify thread counts, ranges and CPU/GPU threads easily  SDK 3.1 in beta with DDT 2.6.1  SDK 3.2  Coming soon: multiple GPU device support
  • 24. 3rd Party: TotalView Debugger  Latest from TotalView debugger (in Beta) — Debugging of application running on the GPU device — Full visibility of both Linux threads and GPU device threads  Device threads shown as part of the parent Unix process  Correctly handle all the differences between the CPU and GPU — Fully represent the hierarchical memory  Display data at any level (registers, local, block, global or host memory)  Making it clear where data resides with type qualification — Thread and Block Coordinates  Built in runtime variables display threads in a warp, block and thread dimensions and indexes  Displayed on the interface in the status bar, thread tab and stack frame — Device thread control  Warps advance Synchronously — Handles CUDA function inlining  Step in to or over inlined functions — Reports memory access errors  CUDA memcheck — Can be used with MPI
  • 25. NVIDIA Visual Profiler Analyze GPU HW performance signals, kernel occupancy, instruction throughput, and more Highly configurable tables and graphical views Save/load profiler sessions or export to CSV for later analysis Compare results visually across multiple sessions to see improvements Windows, Linux and Mac OS X OpenCL support on Windows and Linux Included in the CUDA Toolkit
  • 26. GPU Computing SDK Hundreds of code samples for CUDA C, DirectCompute and OpenCL Finance Oil & Gas Video/Image Processing 3D Volume Rendering Particle Simulations Fluid Simulations Math Functions
  • 27. Approach - Design Patterns © 2009 NVIDIA Corporation
  • 28. Accelerating Existing Applications Profile for Bottlenecks, Identify Possibilities Inspect for Parallelism A Debugger is a good starting point, Port Relevant Portion Consider Libraries & Engines vs. Custom Validate Gains Benchmark vs. CPU version Parallel Nsight, Visual Profiler, Optimize GDB, Tau CUDA, etc. Deploy Maintain original as CPU fallback if desired.
  • 29. Trivial Application (Accelerating a Process) Design Rules: Serial task processing on CPU Data Parallel processing on GPU Copy input data to GPU Perform parallel processing Copy results back Application CPU C Runtime CUDA CFortran CUDA Driver API OpenCL Driver CUDA Runtime CUDA.NET PyCUDA Follow guidance in the CUDA C Best Practices Guide CPU GPU CPU GPU Memory Memory The CUDA C Runtime could be substituted with other methods of accessing the GPU
  • 30. Basic Application – using multi-GPU “Trivial Application” plus: Maximize overlap of data transfers and computation Minimize communication required between processors Use one CPU thread to manage each GPU Application CPU C Runtime CUDA C Runtime CPU GPU GPU CPU GPU GPU Memory Memory Memory Multi-GPU notebook, desktop, workstation and cluster node configurations are increasingly common
  • 31. Graphics Application “Basic Application” plus: Use graphics interop to avoid unnecessary copies In Multi-GPU systems, put buffers to be displayed in GPU Memory of GPU attached to the display Application CPU C Runtime CUDA C Runtime OpenGL / Direct3D CPU GPU CPU GPU Memory Memory
  • 32. Basic Library “Basic Application” plus: Avoid unnecessary memory transfers Use data already in GPU memory Create and leave data in GPU memory Library CPU C Runtime CUDA C Runtime CPU GPU CPU GPU Memory Memory These rules apply to plug-ins as well
  • 33. Application with Plug-ins “Basic Application” plus: Plug-in Mgr Allows Application and Plug-ins to (re)use same GPU memory Application Multi-GPU aware Plug-in Mgr Plug-in Plug-in Plug-in Follow “Basic Library” rules for the Plug-ins CPU C Runtime CUDA C Runtime CPU GPU CPU GPU Memory Memory
  • 34. Multi-GPU Cluster Application Application “Basic Application” plus: CPU C Runtime CUDA C Runtime CPU GPU GPU CPU GPU Memory GPU Memory Memory Use Shared Memory for MPI over Ethernet, Infiniband, etc. intra-node communication Application CPU C Runtime CUDA C Runtime or pthreads, OpenMP, etc. CPU GPU GPU CPU GPU Memory GPU Memory Memory Use MPI to communicate between nodes Application CPU C Runtime CUDA C Runtime CPU GPU GPU CPU GPU Memory GPU Memory Memory
  • 35. Leveraging – Libraries & Engines © 2009 NVIDIA Corporation
  • 36. GPU Computing Software Stack Your GPU Computing Application Application Acceleration Engines Middleware, Modules & Plug-ins Foundation Libraries Low-level Functional Libraries Development Environment Languages, Device APIs, Compilers, Debuggers, Profilers, etc. CUDA Architecture © 2009 NVIDIA Corporation
  • 37. CUFFT Library 3.2: Improving Radix-3, -5, -7 Radix-3 (SP, ECC off) Radix-3 (DP, ECC off ) 250 70 60 200 50 150 GFLOPS GFLOPS 40 C2070 R3.2 C2070 R3.2 C2070 R3.1 30 C2070 R3.1 100 MKL MKL 20 50 10 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 log3(size) log3(size) Radix-5, -7 and mixed radix improvements not shown CUFFT 3.2 & 3.1 on NVIDIA Tesla C2070 GPU MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
  • 38. CUBLAS Library 3.2 performance gains 12x Up to 2x average speedup over CUBLAS 3.1 10x 8x Speedup vs. MKL Less variation in performance for different dimensions vs. 3.1 6x 4x 2x MKL v3.1 0x v3.2 1024 2048 3072 4096 5120 6144 7168 Matrix dimensions (NxN) Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT} CUFFT 3.2 & 3.1 on NVIDIA Tesla C2050 GPU MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
  • 39. 3rd Party Example: CULA (LAPACK for heterogeneous systems) GPU Accelerated Linear Algebra “CULAPACK” Library Partnership » Dense linear algebra Developed in » C/C++ & FORTRAN partnership with » 150+ Routines NVIDIA MATLAB Interface Supercomputer Speeds » 15+ functions Performance 7x of » Up to 10x speedup Intel’s MKL LAPACK
  • 40. CULA Library 2.2 Performance Supercomputing Speeds This graph shows the relative speed of many CULA functions when compared to Intel’s MKL 10.2. Benchmarks were obtained comparing an NVIDIA Tesla C2050 (Fermi) and an Intel Core i7 860. More at www.culatools.com
  • 41. CUSparse Library: Matrix Performance vs. CPU Multiplication of a sparse matrix by multiple vectors 35x 30x 25x 20x 15x 10x "Non-transposed" 5x "Transposed" 0x MKL 10.2 Average speedup across S,D,C,Z CUSPARSE 3.2 on NVIDIA Tesla C2050 GPU MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
  • 42. CURan Libray: Random Number Generation Generating 100K Sobol' Samples - GPU vs. CPU 25x 20x 15x 10x CURAND 3.2 MKL 10.2 5x 0x SP DP SP DP Uniform Normal CURAND 3.2 on NVIDIA Tesla C2050 GPU MKL 10.2.3.029 on Quad-Core Intel Core i7 (Nehalem)
  • 43. NAG GPU Library  Monte Carlo related  L’Ecuyer, Sobol RNGs  Distributions, Brownian Bridge  Coming soon  Mersenne Twister RNG  Optimization, PDEs  Seeking input from the community  For up-to-date information: www.nag.com/numeric/gpus 43
  • 44. NVPP Library: Graphics Performance Primitives  Similar to Intel IPP focused on Aggregate Performance Results NPP Performance Suite Grand Totals image and video processing 12 10  6x - 10x average speedup vs. IPP Relative Agregate Speed 8 — 2800 performance tests 6 4  Core i7 (new) vs. GTX 285 (old) 2 0  Now available with CUDA Toolkit Core2Duo t=1 Core2Duo t=2 Nehalem t=1 Nehalem t=8 Geforce 9800 Geforce GTX GTX+ 285 Processor www.nvidia.com/npp
  • 45. OpenVIDIA  Open source, supported by NVIDIA  Computer Vision Workbench (CVWB) GPU imaging & computer vision Demonstrates most commonly used image processing primitives on CUDA Demos, code & tutorials/information http://openvidia.sourceforge.net
  • 46. More Open Source Projects Thrust: Library of parallel algorithms with high-level STL-like interface http://code.google.com/p/thrust OpenCurrent: C++ library for solving PDE’s over regular grids http://code.google.com/p/opencurrent 200+ projects on Google Code & SourceForge Search for CUDA, OpenCL, GPGPU
  • 47. GPU Computing Software Stack Your GPU Computing Application Application Acceleration Engines Middleware, Modules & Plug-ins Foundation Libraries Low-level Functional Libraries Development Environment Languages, Device APIs, Compilers, Debuggers, Profilers, etc. CUDA Architecture
  • 48.
  • 49. NVIDIA PhysX™ - the World’s Most Deployed Physics API Major PhysX Site Licensees Integrated in Major Cross Platform Middleware & Tool Game Engines Support Integration (APEX) UE3 Diesel SpeedTree 3ds Max Gamebryo Unity 3d Natural Motion Maya Vision Hero Fork Particles Softimage Instinct BigWorld Emotion FX Trinigy
  • 50. Scaling – Grid & Cluster Mngmnt. © 2009 NVIDIA Corporation
  • 51. GPU Management & Monitoring NVIDIA Systems Management Interface (nvidia-smi) Products Features • List of GPUs • Product ID All GPUs • GPU Utilization • PCI Address to Device Enumeration • Exclusive use mode • ECC error count & location (Fermi only) • GPU temperature • Unit fan speeds Server products • PSU voltage/current • LED state • Serial number • Firmware version Use CUDA_VISIBLE_DEVICES to assign GPUs to process © 2009 NVIDIA Corporation
  • 52. Bright Cluster Manager Most Advanced Cluster Management Solution for GPU clusters Includes:  NVIDIA CUDA, OpenCL libraries and GPU drivers  Automatic sampling of all available NVIDIA GPU metrics  Flexible graphing of GPU metrics against time  Visualization of GPU metrics in Rackview  Powerful cluster automation, setting alerts, alarms and actions when GPU metrics exceed set thresholds  Health checking framework based on GPU metrics  Support for all Tesla GPU cards and GPU Computing Systems, including the most recent “Fermi” models 52
  • 53. Symphony Architecture and GPU Compute Hosts Management Hosts C++ API Service Service CUDA Libraries Instance Instance Symphony Manager (GPU aware) Clients Repository Service GPU 1 x64 Host Computer with Service .NETAPI Excel COM API GPU support Service Instance Spreadsheet Symphony Service Instance Manager (GPU aware) Model Director GPU 2 x64 Host Computer with GPU support Java API Service Service dual quad-core CPUs Client Service x64 Host Computer with GPU support Java API Instance Instance x64 Host Computer with GPU support Instance Application Manager (GPU aware) x64 Host Computer with API Java Session Manager GPU support Service Service C++ API Service Client Instance Instance C++ API Instance Manager (GPU aware) Application C++ x64 Host Computer with Session Manager GPU support Host OS .NET API Client Application Computer with GPU support C# EGO – Resource aware orchestration layer 53 Copyright © 2010 Platform Computing Corporation. All Rights Reserved.
  • 55. Learning – Developer Resources © NVIDIA Corporation 2010
  • 56. NVIDIA Developer Resources DEVELOPMENT SDKs AND VIDEO ENGINES & TOOLS CODE SAMPLES LIBRARIES LIBRARIES CUDA Toolkit GPU Computing SDK Video Decode Acceleration Math Libraries Complete GPU computing CUDA C, OpenCL, DirectCompute NVCUVID / NVCUVENC CUFFT, CUBLAS, CUSPARSE, development kit code samples and documentation DXVA CURAND, … Win7 MFT cuda-gdb Graphics SDK NPP Image Libraries GPU hardware debugging DirectX & OpenGL code samples Video Encode Acceleration Performance primitives NVCUVENC for imaging Visual Profiler PhysX SDK Win7 MFT GPU hardware profiler for Complete game physics solution App Acceleration Engines CUDA C and OpenCL OpenAutomate Post Processing Optimized software modules Noise reduction / De-interlace/ for GPU acceleration Parallel Nsight SDK for test automation Polyphase scaling / Color process Integrated development Shader Library environment for Visual Studio Shader and post processing NVPerfKit Optimization Guides OpenGL|D3D performance tools Best Practices for GPU computing and FX Composer Graphics development Shader Authoring IDE http://developer.nvidia.com
  • 57. 4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian) 10 Published books with 4 in Japanese, 3 in English, 2 in Chinese, 1 in Russian
  • 59. GPU Computing Research & Education World Class Research Proven Research Vision Quality GPGPU Teaching Leadership and Teaching Launched June 1st Launched June 1st University of Cambridge with 5 premiere Centers with 7 premiere Centers Harvard University and more in review and more in review University of Utah University of Tennessee John Hopkins University , USA McMaster University, Canada University of Maryland Nanyan University, Singapore Potsdam, USA University of Illinois at Urbana-Champaign Technical University of Ostrava, Czech UNC-Charlotte,USA Tsinghua University CSIRO, Australia Cal Poly San Luis Obispo, USA Tokyo Institute of Technology SINTEF, Norway ITESM, Mexico Chinese Academy of Sciences Czech Technical University, Prague, Czech National Taiwan University Qingdao University, China Premier Academic Partners Exclusive Events, Latest HW, Discounts Teaching Kits, Discounts, Training NV Research Education Academic Partnerships / Fellowships http://research.nvidia.com 350+ Universities Supporting 100’s of Researchers around the globe ever year
  • 60. Thank You! Thank you!
  • 61. Database Application Minimize network communication Move the analysis “upstream” Client Application to stored procedures or Application Server Treat each stored procedure like a “Basic Application” in the DB itself Database Engine Stored Procedure An App Server could also be a “Basic Application” CPU C Runtime CUDA C Runtime A dedicated Client could also be a “Basic Application” CPU CPU Memory GPU GPU Memory Data Mining, Business Intelligence, etc.