Your SlideShare is downloading. ×
0
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Conflux: gpgpu for .net (en)

681

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
681
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CONFLUX: GPGPU FOR .NET
    Eugene Burmako, 2010
  • 2. Videocards: state of the art
    Equipment – tenth/hundreds of ALU clocked at ~1 GHz
    Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS
    API – random memory access, data structures, pointers, subroutines
    API maturity – nearly four years, several generations of graphics processors
  • 3. Videocards: programmer’s PoV
    Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute):
    Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds.
    Kernel is compiled by the driver.
    Iteration bounds are used to create grid of threads.
    Input data is copied to video memory.
    Execution gets kicked off.
    Result is copied to main memory.
  • 4. Example: SAXPY via CUDA
    __global__ void Saxpy(float a, float* X, float* Y)
    {
    inti = blockDim.x * blockIdx.x + threadIdx.x;
    Y[i] = a * X[i] + Y[i];
    }
    cudaMemcpy(X, hX, cudaMemcpyHostToDevice);
    cudaMemcpy(Y, hY, cudaMemcpyHostToDevice);
    Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY);
    cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);
  • 5. Hot question
  • 6. Official answer
  • 7. In fact
    Brahma:
    Data structures: data parallel array.
    Computations: C# expressions, LINQ combinators.
    Accelerator v2:
    Data structures: data parallel array.
    Computations: arithmetic operators, number of predefined functions.
    This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?
  • 8. Example: CUDA interop
    saxpy = @”__global__ void Saxpy(float a, float* X, float* Y)
    {
    inti = blockDim.x * blockIdx.x + threadIdx.x;
    Y[i] = a * X[i] + Y[i];
    }”;
    nvcuda.cuModuleLoadDataEx(saxpy);
    nvcuda.cuMemcpyHtoD(X, Y);
    nvcuda.cuParamSeti(a, X, Y);
    nvcuda.cuLaunchGrid(256, (N + 255) / 256);
    nvcuda.cuMemcpyDtoH(Y);
  • 9. Conflux
    Kernels are written in C#: data structures, local variables, branching, loops
    float a;
    float[] x;
    [Result] float[] y;
    vari = GlobalIdx.X;
    y[i] = a * x[i] + y[i];
  • 10. Conflux
    Avoids explicit interop with unmanaged code, lets programmer use native .NET data types.
    float[] x, y;
    varcfg = new CudaConfig();
    var kernel = cfg.Configure<Saxpy>();
    y = kernel.Execute(a, x, y);
  • 11. How does it work?
    Front end: decompiles C#.
    AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs.
    Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL.
    Interop: binds to nvcuda driver that is capable of executing GPU assembler.
  • 12. Current progress
    http://bitbucket.org/conflux/conflux
    Proof of concept.
    Capable of computing hello-world of parallel computations: matrix multiplication.
    If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU.
    Triple license: AGPL, exception for OSS projects, commercial.
  • 13. Demo
  • 14. Future work
    GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition)
    Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality).
    Distributed execution (a new level of memory hierarchy if we use polyhedral model).
  • 15. Conclusion
    Conflux: GPGPU for .NET
    http://bitbucket.org/conflux/conflux
    eugene.burmako@confluxhpc.net

×