Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

byteLAKE's experience on NVIDIA


Published on

Explore byteLAKE's experience on NVIDIA's platforms. Contact us if you find these interesting.

Published in: Services
  • Be the first to comment

  • Be the first to like this

byteLAKE's experience on NVIDIA

  1. 1. byteLAKE on NVIDIA EXPERIENCE Brief summary of byteLAKE’s experience on NVIDIA’s architecture. This is an appendix to the one-slider about the same. Artificial Intelligence HPC Machine Learning Deep Learning Computer Vision Edge Intelligence byteLAKE pl. Solny 14/3 50-062 Wroclaw, Poland +48 508 091 885 +48 505 322 282 +1 650 735 2063
  2. 2. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 2 HPC Simulations • parallelization of the EULAG model (i.e. weather simulations) • porting of various applications / algorithms to HPC (CPU + GPU) architectures. About EULAG: that particular model has a proven record of successful applications, and excellent efficiency as well as scalability on conventional supercomputer architectures. For instance it is being implemented as the new dynamical core of the COSMO weather prediction framework. Expertise in CUDA, OpenCL, OpenACC and NVIDIA hardware (from supercomputers to small embedded devices). • NVIDIA’s architectures like Kepler (i.e. K80 for servers, GeForce GTX Titan for desktop, Jetson for mobile), Maxwell (i.e. NVIDIA GeForce GTX 980 for desktop), Pascal (i.e. P100 for servers) • we have been working on NVIDIA’s platforms starting from Tesla architecture (i.e. C1060 card; year of 2008) and Fermi architecture (i.e. C2050 card). • We have also started running technical courses on Volta architectures (i.e. Tesla V100).
  3. 3. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 3 More about HPC weather simulations: • we have done a lot of work here in the areas of analyzing the overall algorithm’s resources usage and their influence on the system performance. • based on that, we removed bottlenecks and eventually developed a method of efficient distribution of computation across GPU kernels. • our method analyzes memory transactions between GPU global and shared memories. That helps us deploy various strategies to accelerate the code execution, namely stencil decomposition, block decomposition (with weighting analysis between computation and communication), reduce inter-memory communication, and register file reusing. • besides, we also applied additional optimization techniques including 2.5D blocking, coalesced memory access, padding, and providing a high GPU occupancy, as well as algorithm-specific optimizations such as rearrangement of boundary conditions (i.e. to reduce the branch divergence), and management of exchanging halo areas between graphics processors within a single node. • all of these helped us significantly improve the overall performance of the simulation algorithm. • on top of these, we have built an auto-tuning procedure (machine learning based) that allowed us to automate the adaptation of the simulation to a set of GPUs, taking their individual characteristics into account (algorithm/GPU specific parameters incl. sizes of compute unified device architecture (CUDA) block for each kernel of the algorithm, size of data alignment boundary for each algorithm’s array, configuration of GPU-shared memory, cached or non- cached memory access, and CUDA compute capability setting). Results of the HPC weather simulations improvements: • We have experimentally validated our methods for NVIDIA Kepler-based GPUs (incl. Tesla K20X, GeForce GTX TITAN, a single Tesla K80 GPU, and multi-GPU system with two K80 cards, as well as GeForce GTX 980 GPU based on the NVIDIA Maxwell architecture). • Depending on the grid size and device architecture, our method allowed us to achieve a speed- up over the basic version of the HPC simulation (without auto-tuning mechanism) from 1.1 for GeForce GTX 980 to 1.92 for 2xTesla K80 GPU (side note: low speed-up for GeForce GTX 980 is case specific). • Then we also focused on an inter- and intra- node overlapping between data transfers and GPU computations for the GPU-accelerated cluster.
  4. 4. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 4 • For the Piz Daint cluster (equipped with NVIDIA Tesla K20 GPUs – 2015 year), our approach allowed us to achieve a weak scalability up to 136 nodes. The obtained performance exceeded 16 TFop/s in double precision. All in all our improved code was almost twice faster than the basic one. Besides performance, we also decreased the energy consumption. Therefore we applied a mixed precision arithmetic to the algorithm and managed it dynamically using a modified version of the random forest (machine learning) algorithm. We deployed it on the Piz Daint supercomputer (ranked 3rd at the TOP500 list, as of Nov. 2017) which is equipped with NVIDIA Tesla P100 GPU accelerators that are based on the NVIDIA Pascal architecture. • We have also deployed it on the MICLAB cluster containing NVIDIA Tesla K80 (NVIDIA Kepler- based GPU). As a result, we reduced the energy consumption by up to 36%. Example research publications using NVIDIA hardware: • Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators, Parallel Computing 40(8), 2014, 425-447 • Adaptation of fluid model EULAG to graphics processing unit architecture, Concurrency and Computation: Practice and Experience 27(4), 2015, 937-957 • Performance modeling of 3D MPDATA simulations on GPU cluster, Journal of Supercomputing 73(2), 2017, 664-675 • Systematic adaptation of stencil-based 3D MPDATA algorithm to GPU architectures, Concurrency and Computation: Practice and Experience 29(9), 2017 • Machine learning method for energy reduction by utilizing dynamic mixed precision on GPU- based supercomputers, Concurrency and Computation: Practice and Experience
  5. 5. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 5 Thank you! Let’s stay in touch:
  6. 6. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 6 Learn how we work: Listen Actively We start with a consultancy session to better understand our client’s requirements & assumptions. 1 2 Suggest We thoroughly analyze the gathered information and prepare a draft offer. 3 Agree We fine tune the offer further and wrap up everything into a binding contract. 4 Deliver Finally, the execution starts. We deliver projects in a fully transparent, Agile (SCRUM- based) fashion.
  7. 7. byteLAKE on NVIDIA: EXPERIENCE  Jun-18 7 We build Artificial Intelligence software and integrate that into products. We port and optimize algorithms for parallel, CPU+GPU HPC architectures. We deploy AI on data centers, the cloud and constrained, embedded devices (AI on Edge). byteLAKE We are specialists in: Helping companies transform for the era of Artificial Intelligence. We are a team of scientists, programmers, designers and technology enthusiasts helping industries incorporate AI techniques into products. Machine Learning Deep Learning Computer Vision High Performance Computing Heterogeneous Computing Edge Intelligence