Conflux:gpgpu for .net (en)


Published on

Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Conflux:gpgpu for .net (en)

  1. 1. CONFLUX: GPGPU FOR .NET<br />Eugene Burmako, 2010<br />
  2. 2. Videocards: state of the art<br />Equipment – tenth/hundreds of ALU clocked at ~1 GHz <br />Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS<br />API – random memory access, data structures, pointers, subroutines<br />API maturity – nearly four years, several generations of graphics processors<br />
  3. 3. Videocards: programmer’s PoV<br />Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute):<br />Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds.<br />Kernel is compiled by the driver.<br />Iteration bounds are used to create grid of threads.<br />Input data is copied to video memory.<br />Execution gets kicked off.<br />Result is copied to main memory.<br />
  4. 4. Example: SAXPY via CUDA<br />__global__ void Saxpy(float a, float* X, float* Y) <br />{<br />inti = blockDim.x * blockIdx.x + threadIdx.x; <br /> Y[i] = a * X[i] + Y[i]; <br />}<br />cudaMemcpy(X, hX, cudaMemcpyHostToDevice);<br />cudaMemcpy(Y, hY, cudaMemcpyHostToDevice);<br />Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY);<br />cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);<br />
  5. 5. Hot question<br />
  6. 6. Official answer<br />
  7. 7. In fact<br />Brahma:<br />Data structures: data parallel array.<br />Computations: C# expressions, LINQ combinators.<br />Accelerator v2:<br />Data structures: data parallel array.<br />Computations: arithmetic operators, number of predefined functions.<br />This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?<br />
  8. 8. Example: CUDA interop<br />saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) <br />{<br />inti = blockDim.x * blockIdx.x + threadIdx.x; <br /> Y[i] = a * X[i] + Y[i]; <br />}”;<br />nvcuda.cuModuleLoadDataEx(saxpy);<br />nvcuda.cuMemcpyHtoD(X, Y);<br />nvcuda.cuParamSeti(a, X, Y);<br />nvcuda.cuLaunchGrid(256, (N + 255) / 256);<br />nvcuda.cuMemcpyDtoH(Y);<br />
  9. 9. Conflux<br />Kernels are written in C#: data structures, local variables, branching, loops<br />float a;<br />float[] x;<br />[Result] float[] y;<br />vari = GlobalIdx.X;<br />y[i] = a * x[i] + y[i];<br />
  10. 10. Conflux<br />Avoids explicit interop with unmanaged code, lets programmer use native .NET data types.<br />float[] x, y;<br />varcfg = new CudaConfig();<br />var kernel = cfg.Configure<Saxpy>();<br />y = kernel.Execute(a, x, y);<br />
  11. 11. How does it work?<br />Front end: decompiles C#.<br />AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs.<br />Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL.<br />Interop: binds to nvcuda driver that is capable of executing GPU assembler.<br />
  12. 12. Current progress<br /><br />Proof of concept.<br />Capable of computing hello-world of parallel computations: matrix multiplication.<br />If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU.<br />Triple license: AGPL, exception for OSS projects, commercial.<br />
  13. 13. Demo<br />
  14. 14. Future work<br />GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition)<br />Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality).<br />Distributed execution (a new level of memory hierarchy if we use polyhedral model).<br />
  15. 15. Conclusion<br />Conflux: GPGPU for .NET<br /><br /><br />