PyCUDA provides a Python interface to CUDA that allows developers to write CUDA kernels in Python and execute them on NVIDIA GPUs. The example loads random data onto the GPU, defines a simple element-wise multiplication kernel in Python, compiles and runs the kernel on the GPU to multiply the arrays in parallel, and verifies the result matches multiplying the arrays on the CPU. PyCUDA handles memory transfers between CPU and GPU and provides tools for kernel definition, compilation and execution that abstract away many low-level CUDA API details.
Geth is widely used to interact with Ethereum networks. Ethereum software enables a user to set up a
“private” or “testnet” Ethereum chain. This chain will be totally different from main chain.
Component that tell geth that we want to use/create a private Ethereum Chain:
1. Custom Genesis file
2. Custom Data Directory
3. Custom Network Id
4. Disable Node Discovery
CUDA lab's slides of "parallel programming" courseShuai Yuan
online version:
http://yszheda.github.io/CUDA-lab
I made the slides as a part-time TA for the lab course.
The slides are generated by the great reveal.js.
Geth is widely used to interact with Ethereum networks. Ethereum software enables a user to set up a
“private” or “testnet” Ethereum chain. This chain will be totally different from main chain.
Component that tell geth that we want to use/create a private Ethereum Chain:
1. Custom Genesis file
2. Custom Data Directory
3. Custom Network Id
4. Disable Node Discovery
CUDA lab's slides of "parallel programming" courseShuai Yuan
online version:
http://yszheda.github.io/CUDA-lab
I made the slides as a part-time TA for the lab course.
The slides are generated by the great reveal.js.
Writing a Space Shooter with HTML5 CanvasSteve Purkis
This talk reviews a Space Shooter game that I wrote to learn about HTML5 canvas. It covers:
* Basics of canvas 2D
* Overview of how the game is put together
* Some performance tips
First presented @ Ottawa JavaScript in September 2012.
This talk will explore two libraries, a Cassandra native CQL client and a Clojure DSL for writing CQL3 queries.
This will demonstrate how Cassandra and Clojure are a great fit, show the strength of the functional approach to this domain and how in particular the data centric nature of Clojure makes a lot of sense in this context.
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)Ontico
HighLoad++ 2017
Зал «Рио-де-Жанейро», 8 ноября, 11:00
Тезисы:
http://www.highload.ru/2017/abstracts/2884.html
Java на Linux встречается повсеместно в информационных системах от больших данных до новомодных serverless архитектур. Как Linux, так и Java имеют свои эксплуатационные нюансы. Понимание этих нюансов важно, чтобы заставить стек Java + Linux работать стабильно и эффективно.
Но на практике "джависты" очень любят мыслить кроссплатформенно и не хотят разбираться с особенностями операционной системы, a "линускоиды" считают JVM чуждым миру Linux процессом, пожирающим всю доступную на сервере память.
А потом появляется Docker, и нюансов становится ещё больше...
Цель доклада - рассказать "джавистам" про Linux и Docker, а "линуксоидам" про JVM.
Our favorite language is now powering everything from event-driven servers to robots to Git clients to 3D games. The JavaScript package ecosystem has quickly outpaced past that of most other languages, allowing our vibrant community to showcase their talent. The front-end framework war has been taken to the next level, with heavy-hitters like Ember and Angular ushering in the new generation of long-lived, component-based web apps. The extensible web movement, spearheaded by the newly-reformed W3C Technical Architecture Group, has promised to place JavaScript squarely at the foundation of the web platform. Now, the language improvements of ES6 are slowly but surely making their way into the mainstream— witness the recent interest in using generators for async programming. And all the while, whispers of ES7 features are starting to circulate…
JavaScript has grown up. Now it's time to see how far it can go.
Does your application transmit customer information? Are there fields of sensitive customer data stored in your DB? Can your application be used on insecure networks? If so, you need a working knowledge of encryption and how to leverage Open Source APIs and libraries to make securing your data as easy as possible. Encryption is quickly becoming a developer’s new frontier of responsibility in many data-centric applications.
In today’s data-sensitive and news-sensationalizing world, don’t become the next headline by an inadvertent release of private customer or company data. Secure your persisted, transmitted and in-memory data and learn the terminology you’ll need to navigate the ecosystem of symmetric and public/private key encryption.
MySQL flexible schema and JSON for Internet of ThingsAlexander Rubin
My presentation at Oracle Open World Conference 2017: Using MySQL Flexible Schema (Document Store/JSON) for IoT
Tuesday, Oct 03, 11:30 a.m. - 12:15 p.m. | Marriott Marquis (Yerba Buena Level) - Salon 14
Storing data from sensors (Internet of Things) may be challenging in many respects, specifically due to the changing nature of the data. For example, if you have a fixed table structure and a sensor will need to store new property, it will be hard to make this change. This session discusses different options for implementing flexible schemas with MySQL 5.7 and MySQL 8.0, using JSON and calculated fields as well as the MySQL Document Store feature. It includes a demo with IoT devices where data is stored in MySQL 8.0.
In this video from the 2014 HPC User Forum in Seattle, Amit Vij and Nima Neghaban from GIS Federal present:GPUdb: A Distributed Database for Many-Core Devices.
Learn more: http://insidehpc.com/video-gallery-hpc-user-forum-2014-seattle/
and
http://gisfederal.com/
Watch the video presentation http://wp.me/p3RLHQ-ddd
Writing a Space Shooter with HTML5 CanvasSteve Purkis
This talk reviews a Space Shooter game that I wrote to learn about HTML5 canvas. It covers:
* Basics of canvas 2D
* Overview of how the game is put together
* Some performance tips
First presented @ Ottawa JavaScript in September 2012.
This talk will explore two libraries, a Cassandra native CQL client and a Clojure DSL for writing CQL3 queries.
This will demonstrate how Cassandra and Clojure are a great fit, show the strength of the functional approach to this domain and how in particular the data centric nature of Clojure makes a lot of sense in this context.
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)Ontico
HighLoad++ 2017
Зал «Рио-де-Жанейро», 8 ноября, 11:00
Тезисы:
http://www.highload.ru/2017/abstracts/2884.html
Java на Linux встречается повсеместно в информационных системах от больших данных до новомодных serverless архитектур. Как Linux, так и Java имеют свои эксплуатационные нюансы. Понимание этих нюансов важно, чтобы заставить стек Java + Linux работать стабильно и эффективно.
Но на практике "джависты" очень любят мыслить кроссплатформенно и не хотят разбираться с особенностями операционной системы, a "линускоиды" считают JVM чуждым миру Linux процессом, пожирающим всю доступную на сервере память.
А потом появляется Docker, и нюансов становится ещё больше...
Цель доклада - рассказать "джавистам" про Linux и Docker, а "линуксоидам" про JVM.
Our favorite language is now powering everything from event-driven servers to robots to Git clients to 3D games. The JavaScript package ecosystem has quickly outpaced past that of most other languages, allowing our vibrant community to showcase their talent. The front-end framework war has been taken to the next level, with heavy-hitters like Ember and Angular ushering in the new generation of long-lived, component-based web apps. The extensible web movement, spearheaded by the newly-reformed W3C Technical Architecture Group, has promised to place JavaScript squarely at the foundation of the web platform. Now, the language improvements of ES6 are slowly but surely making their way into the mainstream— witness the recent interest in using generators for async programming. And all the while, whispers of ES7 features are starting to circulate…
JavaScript has grown up. Now it's time to see how far it can go.
Does your application transmit customer information? Are there fields of sensitive customer data stored in your DB? Can your application be used on insecure networks? If so, you need a working knowledge of encryption and how to leverage Open Source APIs and libraries to make securing your data as easy as possible. Encryption is quickly becoming a developer’s new frontier of responsibility in many data-centric applications.
In today’s data-sensitive and news-sensationalizing world, don’t become the next headline by an inadvertent release of private customer or company data. Secure your persisted, transmitted and in-memory data and learn the terminology you’ll need to navigate the ecosystem of symmetric and public/private key encryption.
MySQL flexible schema and JSON for Internet of ThingsAlexander Rubin
My presentation at Oracle Open World Conference 2017: Using MySQL Flexible Schema (Document Store/JSON) for IoT
Tuesday, Oct 03, 11:30 a.m. - 12:15 p.m. | Marriott Marquis (Yerba Buena Level) - Salon 14
Storing data from sensors (Internet of Things) may be challenging in many respects, specifically due to the changing nature of the data. For example, if you have a fixed table structure and a sensor will need to store new property, it will be hard to make this change. This session discusses different options for implementing flexible schemas with MySQL 5.7 and MySQL 8.0, using JSON and calculated fields as well as the MySQL Document Store feature. It includes a demo with IoT devices where data is stored in MySQL 8.0.
In this video from the 2014 HPC User Forum in Seattle, Amit Vij and Nima Neghaban from GIS Federal present:GPUdb: A Distributed Database for Many-Core Devices.
Learn more: http://insidehpc.com/video-gallery-hpc-user-forum-2014-seattle/
and
http://gisfederal.com/
Watch the video presentation http://wp.me/p3RLHQ-ddd
Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,
seccomp is a computer security facility in the Linux kernel, pledge is a similar security facility in the OpenBSD kernel. In this presentation Giovanni Bechis will review the development story and progress of both kernel interfaces and will analyze the main differences. There will be some examples of implementations of security patches made for some important open source projects.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
6. __global__ void dot( float *a, float *b, float *c )
{
__shared__ float cache[threadsPerBlock];
int cacheIndex = threadIdx.x;
...
// set the cache values
cache[cacheIndex] = temp;
// synchronize threads in this block
__syncthreads();
...
}
int main( void )
{
...
dot<<<blocksPerGrid,threadsPerBlock>>>( d_a, d_b, d_c );
...
}
shared memory
7. • thread coop. & shared mem. useful
for reduction algorithms
• avoid race conditions by using
__syncthreads()
• avoid bank conflicts
• every thread in the block needs to
call __syncthreads()
keep in mind
9. __constant__ float constFloat;
__device__ float getConstFloat() { return constFloat; }
__global__ void addConstant(float *vec, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i<N)
vec[i] += getConstFloat();
}
#include <cutil_inline.h>
int main( int argc, char** argv)
{
float constValue = 4.0f;
cutilSafeCall( cudaMemcpyToSymbol(constFloat,
&constValue,
sizeof(float), 0,
cudaMemcpyHostToDevice) );
...
}
constant mem.
10. • read-only, but conserves mem.
bandwidth
• a single read can be broadcasted and
cached for additional reads
• painfully slow when each thread
reads a different address from
constant memory
keep in mind
12. • read-only, like for const. mem.
• great when memory access exhibits
spatial locality, i.e. each thread
reads a loc. near where the next or
previous thread reads
• comes in 1-D, 2-D and 3-D versions
& typically used in finite diff. apps
keep in mind
13. surface<void, 2> output_surface;
__global__ void surfaceWrite(float* g_idata, int width, int height) {
// calculate surface coordinates
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// read from global memory and write to cuarray (via surface reference)
surf2Dwrite(g_idata[y*width+x], output_surface, x*4, y, cudaBoundaryModeTrap);
}
int main( int argc, char** argv) {
...
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0,
cudaChannelFormatKindFloat);
cudaArray* cu_array;
cutilSafeCall( cudaMallocArray(&cu_array, &channelDesc, width, height,
cudaArraySurfaceLoadStore) );
cutilSafeCall( cudaMemcpy( d_data, h_data, size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaBindSurfaceToArray(output_surface, cu_array) );
surfaceWrite<<<dimGrid, dimBlock>>>(d_data, width, height);
...
cutilSafeCall( cudaFree(d_data) );
cutilSafeCall( cudaFreeArray(cu_array) );
}
surface mem.
15. // OpenGL Graphics includes
#include <GL/glew.h>
#if defined (__APPLE__) || defined(MACOSX)
#include <GLUT/glut.h>
#else
#include <GL/freeglut.h>
#endif
int main(int argc, char **argv) {
// Initialize GL
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB);
glutInitWindowSize(1000, 1000);
// Create a window with rendering context and all else we need
glutCreateWindow("CUDA Interop.");
// initialize necessary OpenGL extensions
glewInit();
// Select CUDA device with OpenGL interoperability
if (cutCheckCmdLineFlag(argc, (const char**)argv, "device")) {
cutilGLDeviceInit(argc, argv);
}
else {
cudaGLSetGLDevice( cutGetMaxGflopsDeviceId() );
}
}
set device
16. // vbo variables
GLuint vbo;
struct cudaGraphicsResource *cuda_vbo_resource;
void *d_vbo_buffer = NULL;
// create buffer object
glGenBuffers(1, vbo);
glBindBuffer(GL_ARRAY_BUFFER, *vbo);
// initialize buffer object
unsigned int size = mesh_width * mesh_height * 4 * sizeof(float);
glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
// register this buffer object with CUDA
cutilSafeCall(cudaGraphicsGLRegisterBuffer(cuda_vbo_resource, *vbo,
cudaGraphicsMapFlagsWriteDiscard));
register data with CUDA
17. // map OpenGL buffer object for writing from CUDA
float4 *dptr;
cutilSafeCall( cudaGraphicsMapResources(1, cuda_vbo_resource, 0) );
size_t num_bytes;
cutilSafeCall( cudaGraphicsResourceGetMappedPointer((void **)&dptr,
&num_bytes,
*cuda_vbo_resource) );
// run kernel
kernel<<<blocks,threads>>>(dptr,...);
// unmap buffer object
cutilSafeCall( cudaGraphicsUnmapResources(1, cuda_vbo_resource, 0) );
pass data via shared buffers
18. • need to tell the CUDA runtime the
device we intend to use for CUDA
and OpenGL
• initialize OpenGL first and then use
the cudaGLSetGLDevice() method
• DirectX interop. is nearly identical
keep in mind
25. • creating and recording events is
tricky since some CUDA calls are
asynch.
• all kernel launches are asynch.
• instruct the CPU to synch. on an
event via cudaDeviceSynchronize()
keep in mind
29. // Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice(); cuDeviceGet(device, 0);
CUcontext context = new CUcontext(); cuCtxCreate(context, 0, device);
// Create the PTX file by calling the NVCC and load it
String ptxFileName = preparePtxFile("JCudaVectorAddKernel.cu");
CUmodule module = new CUmodule(); cuModuleLoad(module, ptxFileName);
// Obtain a function pointer to the "add" function.
CUfunction function = new CUfunction(); cuModuleGetFunction(function, module, "add");
// Allocate the device input data
float hostInputA[] = new float[numElements]; CUdeviceptr deviceInputA = new CUdeviceptr();
cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);
cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA), numElements * Sizeof.FLOAT);
...
// Set up the kernel parameters
Pointer kernelParameters = Pointer.to(Pointer.to(deviceInputA),...);
// Call the kernel function
int blockSizeX = 256; int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
cuLaunchKernel(function,
gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null); // Kernel- and extra parameters
cuCtxSynchronize();
jcuda
33. cublasHandle_t handle;
cublasStatus_t status = cublasCreate(&handle);
float* h_A = (float*)malloc(N * N * sizeof(h_A[0]));
...
/* Fill the matrices with test data */
...
/* Allocate device memory for the matrices */
cudaMalloc((void**)&d_A, N * N * sizeof(d_A[0]));
...
/* Initialize the device matrices with the host matrices */
status = cublasSetVector(N * N, sizeof(h_A[0]), h_A, 1, d_A, 1);
...
/* Performs Sgemm: C <- alphaAB + betaC */
status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N,
&alpha, d_A, N, d_B, N, &beta, d_C, N);
/* Allocate host mem & read back the result from device mem */
h_C = (float*)malloc(N * N * sizeof(h_C[0]));
status = cublasGetVector(N * N, sizeof(h_C[0]), d_C, 1, h_C, 1);
/* Memory clean up */
cudaFree(d_A);
...
/* Shutdown */
status = cublasDestroy(handle);
cublas
34. cudaSetDevice( cutGetMaxGflopsDeviceId() );
// Allocate & init. host memory for the signal
Complex* h_signal = (Complex*)malloc(sizeof(Complex) * SIGNAL_SIZE);
...
// Pad signal
Complex* h_padded_signal;
...
// Allocate device memory for signal
Complex* d_signal;
cutilSafeCall( cudaMalloc((void**)&d_signal, mem_size) );
// Copy host memory to device
cutilSafeCall( cudaMemcpy(d_signal, h_padded_signal, mem_size,
cudaMemcpyHostToDevice) );
// CUFFT plan
cufftHandle plan;
cufftSafeCall( cufftPlan1d(&plan, new_size, CUFFT_C2C, 1) );
// Transform signal
cufftSafeCall( cufftExecC2C(plan, (cufftComplex *)d_signal,
(cufftComplex *)d_signal, CUFFT_FORWARD) );
// Destroy CUFFT context
cufftSafeCall( cufftDestroy(plan) );
// Cleanup memory
cutilSafeCall( cudaFree(d_signal) );
...
cutilDeviceReset(); cufft
35. cusparseHandle_t handle = 0;
cusparseStatus_t status = cusparseCreate(&handle);
// create a matrix description for the matrix M
cusparseMatDescr_t descrM = 0; status = cusparseCreateMatDescr(&descrM);
cusparseSetMatType ( descrM, CUSPARSE_MATRIX_TYPE_TRIANGULAR );
cusparseSetMatIndexBase ( descrM, CUSPARSE_INDEX_BASE_ZERO );
cusparseSetMatDiagType ( descrM, CUSPARSE_DIAG_TYPE_NON_UNIT );
cusparseSetMatFillMode ( descrM, CUSPARSE_FILL_MODE_LOWER );
// create & perform analysis info for the non-trans & trans case
cusparseSolveAnalysisInfo_t info = 0, infoTrans = 0;
cusparseCreateSolveAnalysisInfo(&info);
cusparseCreateSolveAnalysisInfo(&infoTrans);
cusparseScsrsv_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, descrM,
d_valsICP, d_rowPtrsICP, d_colIndsICP, info);
cusparseScsrsv_analysis(handle, CUSPARSE_OPERATION_TRANSPOSE, N, descrM,
d_valsICP, d_rowPtrsICP, d_colIndsICP, infoTrans);
...
// Solve M z = H H^T z = r by first doing a forward solve: H y = r
cusparseScsrsv_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 1.0, descrM,
d_valsICP, d_rowPtrsICP, d_colIndsICP, info, d_r, d_y);
// and then a back substitution: H^T z = y
cusparseScsrsv_solve(handle, CUSPARSE_OPERATION_TRANSPOSE, N, 1.0, descrM,
d_valsICP, d_rowPtrsICP, d_colIndsICP, infoTrans, d_y, d_z);
...
cusparseDestroy(handle);
cusparse
37. // declare a host image object for an 8-bit grayscale image
npp::ImageCPU_8u_C1 oHostSrc;
// load gray-scale image from disk
npp::loadImage(sFilename, oHostSrc);
// declare a device image and copy from the host image to the device
npp::ImageNPP_8u_C1 oDeviceSrc(oHostSrc);
// create struct with box-filter mask size
NppiSize oMaskSize = {5, 5};
// create struct with ROI size given the current mask
NppiSize oSizeROI = {oDeviceSrc.width() - oMaskSize.width + 1,
oDeviceSrc.height() - oMaskSize.height + 1};
// allocate device image of appropriately reduced size
npp::ImageNPP_8u_C1 oDeviceDst(oSizeROI.width, oSizeROI.height);
// set anchor point inside the mask to (0, 0)
NppiPoint oAnchor = {0, 0};
// run box filter
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceDst.data(), oDeviceDst.pitch(),
oSizeROI, oMaskSize, oAnchor);
// declare a host image for the result
npp::ImageCPU_8u_C1 oHostDst(oDeviceDst.size());
// and copy the device result data into it
oDeviceDst.copyTo(oHostDst.data(), oHostDst.pitch());
npp
40. // loop over full data, in bite-sized chunks
for (int i=0; i<FULL_DATA_SIZE; i+= N) {
// copy the locked memory to the device, async
cutilSafeCall( cudaMemcpyAsync(dev_a, host_a+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream) );
cutilSafeCall( cudaMemcpyAsync(dev_b, host_b+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream) );
kernel<<<N/256,256,0,stream>>>(dev_a, dev_b, dev_c);
// copy the data from device to locked memory
cutilSafeCall( cudaMemcpyAsync(host_c+i, dev_c,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream) );
}
// wait for all operations to finish
cutilSafeCall( cudaStreamSynchronize(stream) );
chunked computation
41. cudaStream_t *streamArray = 0;
streamArray = (cudaStream_t *)malloc(N * sizeof (cudaStream_t *));
...
for ( int i = 0; i < N ; i++) {
cudaStreamCreate(&streamArray[i]);
...
}
...
for ( int i = 0; i < N ; i++) {
cublasSetMatrix (..., devPtrA[i], ...);
...
}
...
for ( int i = 0; i < N ; i++) {
cublasSetStream(handle, streamArray[i]);
cublasSgemm(handle, ..., devPtrA[i], devPtrB[i], devPtrC[i], ...);
}
cudaThreadSynchronize();
batched computation
42. • use it to specify in which order
operations get executed async.
• idea is to use more than 1 stream
• requires a new kind of mem. copy
which in turn requires pinned: paged
locked mem.
• free pinned mem. when not needed
keep in mind
43. // Allocate resources
for( int i =0; i<STREAM_COUNT; ++i ) {
cudaHostAlloc(&h_data_in[i], memsize, cudaHostAllocDefault);
cudaMalloc(&d_data_in[i], memsize);
...
}
int current_stream = 0;
// Do processing in a loop...
{
int next_stream = (current_stream + 1 ) % STREAM_COUNT;
// Ensure that processing and copying of the last cycle has finished
cudaEventSynchronize(cycleDone[next_stream]);
// Process current frame
kernel<<<grid, block, 0, stream[current_stream]>>>(d_data_out[current_stream],
d_data_in[current_stream],
N, ...);
// Upload next frame
cudaMemcpyAsync(d_data_in[next_stream], ..., cudaMemcpyHostToDevice,
stream[next_stream]);
// Download current frame
cudaMemcpyAsync(h_data_out[current_stream], ..., cudaMemcpyDeviceToHost,
stream[current_stream]);
cudaEventRecord(cycleDone[current_stream], stream[current_stream]);
current_stream = next_stream;
}
overlap kernel exec. & memcpy
44. • devices with CC 1.1 and above can
overlap a kernel exec & memcpy as
long as they are issued from
different streams
• kernels are serialized
• queue in a way that independent
streams can execute in parallel
keep in mind
46. float *a, *d_a;
...
/* Allocate mapped CPU memory. */
cutilSafeCall( cudaHostAlloc((void **)&a, bytes, cudaHostAllocMapped) );
...
/* Initialize the vectors. */
for(n = 0; n < nelem; n++) { a[n] = rand() / (float)RAND_MAX; ... }
/* Get the device pointers for the pinned CPU memory mapped into the GPU
memory space. */
cutilSafeCall( cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0) );
...
/* Call the GPU kernel using the device pointers for the mapped memory. */
...
kernel<<<grid, block>>>(d_a, d_b, d_c, nelem);
...
/* Memory clean up */
cutilSafeCall( cudaFreeHost(a) );
...
zero-copy host memory
47. //Create streams for issuing GPU command asynchronously and allocate memory
for(int i = 0; i < GPU_N; i++) {
cutilSafeCall( cudaStreamCreate(&stream[i]) );
cutilSafeCall( cudaMalloc((void**)&d_Data[i], dataN * sizeof(float)) );
cutilSafeCall( cudaMallocHost((void**)&h_Data[i], dataN * sizeof(float)) );
//init h_Data
}
//Copy data to GPU, launch the kernel and copy data back. All asynchronously
for(int i = 0; i < GPU_N; i++) {
//Set device
cutilSafeCall( cudaSetDevice(i) );
// Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync(d_Data[i], h_Data[i], dataN * sizeof(float),
cudaMemcpyHostToDevice, stream[i]) );
// Perform GPU computations
kernel<<<blocks, threads, 0, stream[i]>>>(...)
// Copy back the result
cutilSafeCall( cudaMemcpyAsync(h_Sum_from_device[i], d_Sum[i],
ACCUM_N * sizeof(float),
cudaMemcpyDeviceToHost, stream[i]) );
}
streams
48. // Process GPU results
for(i = 0; i < GPU_N; i++) {
// Set device
cutilSafeCall( cudaSetDevice(i) );
// Wait for all operations to finish
cudaStreamSynchronize(stream[i]);
// Shut down this GPU
cutilSafeCall( cudaFreeHost(h_Data[i]) );
cutilSafeCall( cudaFree(d_Data[i]) );
cutilSafeCall( cudaStreamDestroy(stream[i]) );
}
// shutdown
for(int i = 0; i < GPU_N; i++) {
cutilSafeCall( cudaSetDevice(i) );
cutilDeviceReset();
}
process the result
49. • can also control each GPU by a
separate CPU thread
• need to assign portable pinned
memory if a different thread needs
access to one thread’s memory
• use the flag cudaHostAllocPortable
to cudaHostAlloc()
keep in mind
50. // Initialize MPI state
MPI_CHECK( MPI_Init(&argc, &argv) );
// Get our MPI node number and node count
int commSize, commRank;
MPI_CHECK( MPI_Comm_size(MPI_COMM_WORLD, &commSize) );
MPI_CHECK( MPI_Comm_rank(MPI_COMM_WORLD, &commRank) );
if(commRank == 0) {// Are we the root node?
//initialize dataRoot...
}
// Allocate a buffer on each node
float * dataNode = new float[dataSizePerNode];
// Dispatch a portion of the input data to each node
MPI_CHECK( MPI_Scatter(dataRoot, dataSizePerNode, MPI_FLOAT, dataNode,
dataSizePerNode, MPI_FLOAT, 0, MPI_COMM_WORLD) );
// if commRank == 0 then free dataRoot...
kernel<<<gridSize, blockSize>>>(dataNode, ...);
// Reduction to the root node
float sumNode = sum(dataNode, dataSizePerNode);
float sumRoot;
MPI_CHECK( MPI_Reduce(&sumNode, &sumRoot, 1, MPI_FLOAT, MPI_SUM, 0,
MPI_COMM_WORLD) );
MPI_CHECK( MPI_Finalize() ); mpi + cuda
51. // Enable peer access
cutilSafeCall(cudaSetDevice(gpuid_tesla[0]));
cutilSafeCall(cudaDeviceEnablePeerAccess(gpuid_tesla[1], gpuid_tesla[0]));
...
// Allocate buffers
cudaSetDevice(gpuid_tesla[0]); cudaMalloc(&g0, buf_size);
cudaSetDevice(gpuid_tesla[1]); cudaMalloc(&g1, buf_size);
// Ping-pong copy between GPUs
cudaMemcpy(g1, g0, buf_size, cudaMemcpyDefault);
// Prepare host buffer and copy to GPU 0
cudaSetDevice(gpuid_tesla[0]); cudaMemcpy(g0, h0, buf_size, cudaMemcpyDefault);
// Run kernel on GPU 1, reading input from the GPU 0 buffer, writing
// output to the GPU 1 buffer: dst[idx] = src[idx] * 2.0f
cudaSetDevice(gpuid_tesla[1]); kernel<<<blocks, threads>>>(g0, g1);
cutilDeviceSynchronize();
// Disable peer access (also unregisters memory for non-UVA cases)
cudaSetDevice(gpuid_tesla[0]); cudaDeviceDisablePeerAccess(gpuid_tesla[1]);
cudaSetDevice(gpuid_tesla[1]); cudaDeviceDisablePeerAccess(gpuid_tesla[0]);
cudaFree(g0);
...
P2P & unified virtual address space