Cuda meetup presentation 5

CUDA PARELLEL
PROGRAMMING AND
GAMING MEETUP
RIGA, LATVIA, JANUARY 12, 2015

Rihards Gailums
Twitter: @RihardsGailums
rihards.gailums@rhtu.edu.lv

• Personal Introductions;
• CES Nvidia news;
• Jetson TK1;
• Anouncements;
• Summary
Agenda

Who am I
What I am doing
What is my parallel programming
experience [CUDA] experience
Introduction

Exponential Technologies
Drones AUTOMOTIVE 3D Printing Artificial intelligence
Biotechnology Design ROBOTICS ComputerVision

The Startup
University
Latvia Tech and
Entrepreneurship
meetup network

Riga BioTechnology Meetup
UI/UX Riga Meetup
Riga Drone Meetup
Riga Mobile App Developer
Meetup
Kick-off Meetup
December 12, THE Mill
December Meetup
Drone Kick-off Meetup
Kick-off Meetup

Riga Startup: Idea to IPO
CUDA parallel programming and
gaming meetup Riga
3D Printing Riga
Meetup
Bitcoin and Cryptocurrencies
Meetup
Find you Co-founder
December CUDA Meetup
January 3D Printing
January 8, RTU Design
Factory
JanuARY Meetup
jANUARy 26, THE Mill

Technology and Entrepreneurship education

Paralell computing
Research Center
CUDA LABORATORY

http://www.anandtech.com/show/7905/nvidia-announces-jetson-tk1-dev-board-adds-erista-to-tegra-roadmap

https://www.youtube.com/watch?v=aw01HwTN1MM

Login Credentials
Username: ubuntu
Password: ubuntu

Install the NVIDIA Linux driver binary release on your target located in:
${HOME}/NVIDIA-INSTALLER
Step 1)
Change directories into the NVIDIA installation directory:
cd ${HOME}/NVIDIA-INSTALLER
Step 2)
Run the installer script to extract and install the Linux driver binary release:
sudo ./installer.sh
Step 3)
Reboot the system to have the graphical desktop UI come up.

CUDA SDK Demo Samples
• Particles
• Nbody
• Smokeparticles
• waves

Jetson/Installing CUDA
http://elinux.org/Jetson/Installing_CUDA

Jetson/Installing OpenCV
http://elinux.org/Jetson/Installing_OpenCV

Open CV
http://docs.opencv.org/doc/tutorials/tutorials.html

http://docs.opencv.org/doc/tutorials/objdetect/cascade_classif
ier/cascade_classifier.html#cascade-classifier
Object Detection

Jetson TEGRA TK1
Tegra K1 SOC
• Kepler GPU with 192 CUDA cores
• 4-Plus-1 quad-core ARM Cortex A15 CPU
• 2 GB x16 memory with 64 bit width
• 16 GB 4.51 eMMC memory
• 1 Half mini-PCIE slot
• 1 Full size SD/MMC connector
• 1 Full-size HDMI port
• 1 USB 2.0 port, micro AB
• 1 USB 3.0 port, A
• 1 RS232 serial port
• 1 ALC5639 Realtek Audio codec with Mic in and Line out
• 1 RTL8111GS Realtek GigE LAN
• 1 SATA data port
• SPI 4MByte boot flash

• IT industryexperiencesanincreasinggrowth for displaysurfaceswith high resolution
• Usecasesfor suchsurfacesincludesatellite and map data,x-rayand microscopeimages,multimedia,CCTV,etc.
• Existing solutions arenot scalable, do notoffer hardwareabstraction,suffer fromwiring limitations
Proposed Virtual Machine Based Monitor Wall Architecture
Introduction Scalability
Conclusions
• The current experiments show that this architecture is very feasible
for non FPS intensiveusecases where the displaywall can bedriven
byasingle physicalGPU
• The total resolution provided by this architecture even using the
currently available compression technology greatly exceeds the
resolutions of existing solutions, it would be expected for the
resolutionto grow inthe future
• The architecture itself scales very good, it is limited mainly by OS
support for multiple monitors (this can be overcome by simulating
a single high resolution display in the virtual machine that spans the
whole resolution of the physical wall) and the possibility to stack
multiple GPU’sin thehostsystem
• Future work should focus on the ability to virtualize OpenGL and
Direct3Dto removethe advantages ofnon-virtualized architectures
OS GPU
GPU
Monit
or
Monit
or
Monit
or
Monit
or
OS GPU
Monit
or
Monit
or
Monit
or
Monit
or
Splitter /
Scaler
Currentlythere aretwo mainalternatives to beused asthe H.264encoderin this architecture –Intel Quick Syncand NVENC.NVCENCis morefeasiblebecause:
• Thetotal encoding powercanbeincreasedbystacking up multiple GPUsthat supportNVENCwithout penalties while notall Intel QuickSyncGPUshavebuilt in video memory so scaling thesecardsintroduceaperformancepenalty
of using systemmemory
• NVENCdoes notput anylimitations onother componentsofthe system,while Intel Quick Syncsupportsalimited amountofCPUs
• Currentbenchmarksseemto showthat the overallFPS performancefor asingle GPU (whichisthe main criteriafor this architecture)is better for NVENCthan Intel QuickSync
Why NVENC?
Pro:OScannatively managethe displays
Con: Powerconsumption,supportedmonitorcountlimited bythe output countof theGPUsand
expansionslots for theGPUsonthe motherboard,deploymentislimited bywiring
Pro:Softwarecomplexityisreducedsinceit doesnot haveto bemultiple monitor aware
Con: SmallresolutionandDPI, visualization is notdisplayed in it’s nativeresolution
Con: Expensive
Currently Popular Monitor Wall Architectures
Pro:Scalable,hostmachinecanrunmultiplevirtual machines,multiple
virtualized GPU’s mapto physicalGPU’s to maximizeefficiency
Pro:LANconnectionto thedisplaywall removeswire length
limitations forcedbyDVI/HDMI cables
Pro:Total resolutionof thewall goes beyondtheones that canbe
achievedusing physical hardware
Con: Lossycompression
Con: NoDirect3D,OpenGLsupport
• Thehostmachinecollects the framebuffer datafrom thevirtual machineGPUsand performsH.264 encodingof thevideo stream onthe physicalhost
GPUthus thearchitectureheavilyreliesonafast hardwareH.264encoderallowing thehosted virtual machinesto fully usetheCPU
• NonFPS intensiveusecasesallow agreatnumberof virtual monitors to behosted onasingle physicalGPU thus reducingthe power consumption
0
50
100
Using Video…
Maximum
Thegraphbelowdemonstratesthescalability possibilities in termsof
possiblemaximalamountof connectedmonitors for thetraditional
architectureversustheproposedoneonaQuadroK4000 cardthat has 4
outputs.
Displaywallarchitecturewhereeachoutput of theGPU mapsto atileon thedisplaywall Displaywallarchitecturewhereeachoutput of theGPU issplit/upscaled among the tileson thedisplaywall
Host Machine
Virtual MachineG
P
U
G
P
U
G
P
U
G
P
UG
P
U
G
P
U
G
P
U
G
P
UG
P
U
G
P
U
G
P
U
G
P
UG
P
U
G
P
U
G
P
U
G
P
U
H.264/RTP/LAN
GPU
Theproposed displaywallarchitecturewhereeachoutput of avirtualGPU mapsto a
tileonthedisplaywallandistransmittedasaH.264streamoverLAN
Virtualmachinebasedmonitor wallrunningGooglemapsinsideChromewebbrowser on16tilesat1920x1080pixelseachgivingatotal
resolutionof 32megapixels
Eachtilehasadedicated LANconnectionandH.264decoder
Scalabilityof supported monitor count
NVENC BasedH.264 Encoding forVirtual
Machine BasedMonitorWall Architecture
R.Bundulis(rudolfs.bundulis@lu.lv),G.Arnicans (guntis.arnicans@lu.lv), and R.Gailums (rihards.gailums@rhtu.edu.lv)
UniversityofLatvia/RigaHighTechUniversity,Latvia

For Startups by Meetup members:
$1800 per year of FREE Azure cloud services
Free Microsoft software and tools

Latvijas Garantiju Aģentūra
Government corporations which supports Latvian entepreneurs
and helps in realisation of their business ideas.

GPUCPU
Add GPUs: Accelerate Science Applications
© NVIDIA 2013

Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
© NVIDIA 2013

Fastest Performance on Scientific Applications
Tesla K20X Speed-Up over Sandy Bridge CPUs
CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU
Disclaimer: Non-NVIDIA implementations may not have been fully optimized
0.0x 5.0x 10.0x 15.0x 20.0x
AMBER
SPECFEM3D
Chroma
MATLAB (FFT)*Engineering
Earth
Science
Physics
Molecular
Dynamics
© NVIDIA 2013

Why Computing Perf/Watt Matters?
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
© NVIDIA 2013

World’s Fastest, Most Energy Efficient Accelerator
Tesla K20X
Tesla K20
Xeon CPU,
E5-2690
Xeon Phi
225W
0.0
1.0
2.0
3.0
0.0 0.5 1.0 1.5
SGEMM(TFLOPS)
DGEMM (TFLOPS)
Tesla K20X vs Xeon CPU
8x Faster SGEMM
6x Faster DGEMM
Tesla K20X vs Xeon Phi
90% Faster SGEMM
60% Faster DGEMM
© NVIDIA 2013

Introduction to the
CUDA Platform

CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirectSMX
Dynamic
Parallelism
HyperQ
Programming
Approaches
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum Flexibility
Easily Accelerate
Apps
Development
Environment
Nsight IDE
Linux, Mac and Windows
GPU Debugging and
Profiling
CUDA-GDB
debugger
NVIDIA Visual
Profiler
Open Compiler
Tool Chain
Enables compiling new languages to CUDA
platform, and CUDA languages to other
architectures
www.nvidia.com/getcuda
© NVIDIA 2013

Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Easily Accelerate
Applications
3 Ways to Accelerate Applications
Maximum
Flexibility
© NVIDIA 2013

3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA 2013

Libraries: Easy, High-Quality
Acceleration
• Ease of use: Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
• Quality: Libraries offer high-quality implementations of functions
encountered in a broad range of applications
• Performance: NVIDIA libraries are tuned by experts
© NVIDIA 2013

Some GPU-accelerated Libraries
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal
Image Processing
GPU Accelerated
Linear Algebra
Matrix Algebra
on GPU and
Multicore
NVIDIA cuFFT
C++ STL
Features for
CUDAIMSL Library
Building-block
Algorithms for
CUDA
ArrayFire Matrix
Computations
Sparse Linear
Algebra
© NVIDIA 2013

3 Steps to CUDA-accelerated
application
• Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
• Step 2: Manage data locality
- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
• Step 3: Rebuild and link the CUDA-accelerated library
nvcc myobj.o –l cublas
© NVIDIA 2013

Explore the CUDA (Libraries) Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer.nvidia.com/cuda-tools-ecosystem
© NVIDIA 2013

OpenACC Directives
© NVIDIA 2013
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C
code
Simple Compiler hints
Compiler Parallelizes
code
Works on many-core
GPUs & multicore CPUs
OpenACC
compiler
Hint

• Easy: Directives are the easy path to accelerate
compute intensive applications
• Open: OpenACC is an open GPU directives standard,
making GPU programming straightforward and
portable across parallel and multi-core processors
• Powerful: GPU Directives allow complete access to the
massive parallel power of a GPU
OpenACC
The Standard for GPU Directives
© NVIDIA 2013

Real-Time Object
Detection
Global Manufacturer of
Navigation Systems
Valuation of Stock
Portfolios using Monte
Carlo
Global Technology Consulting
Company
Interaction of Solvents
and Biomolecules
University of Texas at San Antonio
Directives: Easy & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing
CUDA kernels. The most important thing is avoiding restructuring of existing code for
production applications.
” -- Developer at the Global Manufacturer of
Navigation Systems
“
5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
© NVIDIA 2013

Start Now with OpenACC Directives
Free trial license to PGI
Accelerator
Tools for quick ramp
www.nvidia.com/gpudirectives
Sign up for a free trial of
the directives compiler
now!
© NVIDIA 2013

GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDA, CopperheadPython
Alea.cuBaseF#
MATLAB, Mathematica, LabVIEWNumerical analytics
© NVIDIA 2013

// generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end(),
rand);
// transfer data to device (GPU)
thrust::device_vector<int> d_vec = h_vec;
// sort data on device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(),
d_vec.end(),
h_vec.begin());
Rapid Parallel C++ Development
• Resembles C++ STL
• High-level interface
• Enhances developer
productivity
• Enables performance
portability between GPUs and
multicore CPUs
• Flexible
• CUDA, OpenMP, and TBB
backends
• Extensible and customizable
• Integrates with existing
software
• Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com

MATLAB
http://www.mathworks.com/discovery/
matlab-gpu.html
Learn More
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop
or desktop PC!
CUDA C/C++
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
http://developer.nvidia.com/thrust
CUDA Fortran
http://developer.nvidia.com/cuda-toolkit
GPU.NET
http://tidepowerd.com
PyCUDA (Python)
http://mathema.tician.de/software/pycuda
Mathematica
http://www.wolfram.com/mathematica/new
-in-8/cuda-and-opencl-support/
© NVIDIA 2013

Getting Started
© NVIDIA 2013
• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda
• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight
• Programming Guide/Best Practices:
• docs.nvidia.com
• Questions:
• NVIDIA Developer forums: devtalk.nvidia.com
• Search or ask on: www.stackoverflow.com/tags/cuda
• General: www.nvidia.com/cudazone

© NVIDIA 2013
https://www.youtube.com/watch?v=IzU4AVcMFys
Intro to CUDA - An introduction, how-to, to NVIDIA's GPU parallel
Mythbusters Demo GPU versus CPU: http://youtu.be/-P28LKWTzrI

NVIDIA GTX 750Ti
• Nvidia MAXWELL technology
• Cost – 170 USD
• Only 60 W of power, no dedicated power connections
• 250 MHash/sek
Vs
• Nvidia GTX 780 – 350 MHash/sek + Power cosumption
• Nvidia TESLA K40 – 560 MHash/sek + Power cosumption

Latvian CUDA & parallel programming
ecosystem
Next meetups, frequency
Speakers
Topics
Group marketing channels

Cuda meetup presentation 5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Cuda meetup presentation 5

Similar to Cuda meetup presentation 5 (20)

Recently uploaded

Recently uploaded (20)

Cuda meetup presentation 5