This document discusses various ways to accelerate applications using GPUs and CUDA programming. It provides examples of using libraries, programming languages like OpenACC and CUDA, and tools like Nsight to add GPU acceleration. It also highlights many success stories and how applications from fields like HPC, deep learning, and computational chemistry have achieved speedups using these techniques. Resources and compilers are available to help developers get started with GPU programming.
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI webinar series.
The focus of the webinar was the impact of processing AI data on data centres - particularly from the technology perspective. Timothy Lanfear, Director of Solution Architecture and Engineering EMEA, NVIDIA, presented on a Universal Accelerated Computing Platform.
The document discusses NVIDIA data center GPUs such as the A100, A30, A40, and A10 and their performance capabilities. It provides examples of GPU accelerated application performance showing simulations in Simulia CST Studio, Altair CFD, and Rocky DEM achieving excellent speedups on GPUs. It also discusses Paraview visualization being accelerated with NVIDIA OptiX ray tracing, further sped up using RT cores. Looking ahead, the document outlines NVIDIA Grace CPUs which are designed to improve memory bandwidth between CPUs and GPUs for giant AI and HPC models.
In this deck from Switzerland HPC Conference, Gunter Roeth from NVIDIA presents: Deep Learning on the SaturnV Cluster.
"Machine Learning is among the most important developments in the history of computing. Deep learning is one of the fastest growing areas of machine learning and a hot topic in both academia and industry. It has dramatically improved the state-of-the-art in areas such as speech recognition, computer vision, predicting the activity of drug molecules, and many other machine learning tasks. The basic idea of deep learning is to automatically learn to represent data in multiple layers of increasing abstraction, thus helping to discover intricate structure in large datasets. NVIDIA has invested in SaturnV, a large GPU-accelerated cluster, (#28 on the November 2016 Top500 list) to support internal machine learning projects. After an introduction to deep learning on GPUs, we will address a selection of open questions programmers and users may face when using deep learning for their work on these clusters."
Watch the video: http://wp.me/p3RLHQ-gDv
Learn more: http://www.nvidia.com/object/dgx-saturnv.html
and
http://hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an update on PGI compilers and tools for heterogeneous supercomputing. It discusses PGI's support for OpenACC directives to accelerate applications on multicore CPUs and NVIDIA GPUs from a single source. It highlights new compiler features including support for Intel Skylake, AMD EPYC and IBM POWER9 CPUs as well as NVIDIA Volta GPUs. Benchmark results show strong performance of OpenACC applications on these platforms. The document also discusses the growing adoption of OpenACC in HPC applications and resources available to support OpenACC development.
Jensen Huang, founder and CEO of NVIDIA, discusses the rise of GPU computing and artificial intelligence. He outlines how GPUs have enabled massive performance increases for deep learning workloads. NVIDIA is introducing new products like the Tesla V100 GPU and DGX-1 server to further accelerate AI research and commercial applications. These announcements position NVIDIA to power continued growth in AI and deep learning.
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI webinar series.
The focus of the webinar was the impact of processing AI data on data centres - particularly from the technology perspective. Timothy Lanfear, Director of Solution Architecture and Engineering EMEA, NVIDIA, presented on a Universal Accelerated Computing Platform.
The document discusses NVIDIA data center GPUs such as the A100, A30, A40, and A10 and their performance capabilities. It provides examples of GPU accelerated application performance showing simulations in Simulia CST Studio, Altair CFD, and Rocky DEM achieving excellent speedups on GPUs. It also discusses Paraview visualization being accelerated with NVIDIA OptiX ray tracing, further sped up using RT cores. Looking ahead, the document outlines NVIDIA Grace CPUs which are designed to improve memory bandwidth between CPUs and GPUs for giant AI and HPC models.
In this deck from Switzerland HPC Conference, Gunter Roeth from NVIDIA presents: Deep Learning on the SaturnV Cluster.
"Machine Learning is among the most important developments in the history of computing. Deep learning is one of the fastest growing areas of machine learning and a hot topic in both academia and industry. It has dramatically improved the state-of-the-art in areas such as speech recognition, computer vision, predicting the activity of drug molecules, and many other machine learning tasks. The basic idea of deep learning is to automatically learn to represent data in multiple layers of increasing abstraction, thus helping to discover intricate structure in large datasets. NVIDIA has invested in SaturnV, a large GPU-accelerated cluster, (#28 on the November 2016 Top500 list) to support internal machine learning projects. After an introduction to deep learning on GPUs, we will address a selection of open questions programmers and users may face when using deep learning for their work on these clusters."
Watch the video: http://wp.me/p3RLHQ-gDv
Learn more: http://www.nvidia.com/object/dgx-saturnv.html
and
http://hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an update on PGI compilers and tools for heterogeneous supercomputing. It discusses PGI's support for OpenACC directives to accelerate applications on multicore CPUs and NVIDIA GPUs from a single source. It highlights new compiler features including support for Intel Skylake, AMD EPYC and IBM POWER9 CPUs as well as NVIDIA Volta GPUs. Benchmark results show strong performance of OpenACC applications on these platforms. The document also discusses the growing adoption of OpenACC in HPC applications and resources available to support OpenACC development.
Jensen Huang, founder and CEO of NVIDIA, discusses the rise of GPU computing and artificial intelligence. He outlines how GPUs have enabled massive performance increases for deep learning workloads. NVIDIA is introducing new products like the Tesla V100 GPU and DGX-1 server to further accelerate AI research and commercial applications. These announcements position NVIDIA to power continued growth in AI and deep learning.
Jensen Huang, Founder and CEO of NVIDIA, discusses the new era of AI and how GPU computing is driving innovations in deep learning. He highlights NVIDIA's role in accelerating scientific breakthroughs using AI, developing the Holodeck for virtual design collaboration, and powering autonomous machines and vehicles. Huang also announces that Taiwan's Ministry of Science and Technology has adopted NVIDIA's platform to develop AI capabilities in manufacturing, IoT, healthcare and more.
This document discusses NVIDIA's chips for automotive, HPC, and networking. For automotive, it describes the Tegra line of SOC chips used in cars like Tesla, and upcoming chips like Orin and Atlan. For HPC, it introduces the upcoming Grace CPU designed for giant AI models. For networking, it presents the BlueField line of data processing units (DPUs) including the new 400Gbps BlueField-3 chip and the DOCA software framework. The document emphasizes that NVIDIA's GPU, CPU, and DPU chips make yearly leaps while sharing a common architecture.
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
This document provides an overview of RAPIDS, an open source suite of libraries for GPU-accelerated data science. It discusses how RAPIDS uses GPUs to accelerate ETL, machine learning, and other data science workflows. Key points include:
- RAPIDS includes libraries like cuDF for dataframes, cuML for machine learning, and cuGraph for graph analytics. It aims to provide familiar Python APIs for these tasks.
- cuDF provides over 10x speedups for ETL tasks like data loading, transformations, and feature engineering by keeping data on the GPU.
- cuML provides GPU-accelerated versions of popular scikit-learn algorithms like linear regression, random forests,
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudUniva, an Altair Company
This document discusses Multi-Instance GPU (MIG) on the NVIDIA A100 GPU. MIG allows a single GPU to be partitioned into multiple isolated GPU instances. This increases GPU utilization by right-sizing allocations to workloads. MIG provides benefits like guaranteed quality of service, versatile profiles to maximize utilization, and no required code changes. MIG is well-suited for workloads that don't use the full GPU, like HPC, prototyping, inference, and light training. It allows optimizing GPU utilization and expanding access to more users simultaneously with predictable performance.
This is a presentation I presented at NVIDIA AI Conference in Korea. It's about building the largest GPU - DGX-2, the most powerful supercomputer in one node.
JMI Techtalk: 한재근 - How to use GPU for developing AILablup Inc.
이 Techtalk에서는 AI 개발을 위해 GPU를 사용할 때 Nvidia가 제공하는 성능 향상을 위한 다양한 방법들을 기술자료들과 함께 소개합니다. 특히 Volta 아키텍처를 기반으로 Mixed precision을 도입하여 성능을 향상하는 과정에 관한 내용을 자세히 다룹니다.
This Techtalk introduces a variety of ways to improve the performance that Nvidia provides when using the GPU for AI development, along with technical resources. In particular, this talk discusses the process of improving performance by introducing mixed precision based on the Volta architecture.
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
This was presented by Peng Fei GOU (IBM China) at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/68/NVDLA%20on%20OpenCAPI.pdf
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
The document discusses NVIDIA's role in powering the world's fastest supercomputers. It notes that the Summit supercomputer at Oak Ridge National Laboratory is now the fastest system, powered by 27,648 Volta Tensor Core GPUs to achieve over 122 petaflops. NVIDIA GPUs also power 17 of the world's 20 most energy efficient supercomputers, including Europe's fastest Piz Daint and Japan's fastest Fugaku supercomputer. Over 550 applications are now accelerated using NVIDIA GPUs.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
- The document discusses the latest HPC technologies used in AI/Big Data infrastructures such as TSUBAME3.0 at Tokyo Institute of Technology and ABCI at AIST.
- It provides an overview of the capabilities and achievements of these supercomputers, including TSUBAME2.0 receiving the 2011 ACM Gordon Bell Prize.
- It emphasizes that future supercomputers need to focus on "BYTES" capabilities like bandwidth and capacity to better support large-scale data processing for AI/Big Data applications.
Supercomputing has swept rapidly from the far edges of science to the heart of our everyday lives. And propelling it forward – bringing it into the mobile phone already in your pocket and the car in your driveway – is GPU acceleration, NVIDIA CEO Jen-Hsun Huang told a packed house at a rollicking event kicking off this week’s SC15 annual supercomputing show in Austin. The event draws 10,000 researchers, national lab directors and others from around the world.
This document discusses the evolution of data storage needs from traditional structured data to modern unstructured data like objects and machine data. It outlines the four industrial revolutions defined by major technological advances. Pure Storage's FlashBlade is introduced as the industry's first data hub purpose-built for AI and deep learning, with massively parallel architecture powered by Purity software to scale without limits. Real-world customer examples demonstrate how FlashBlade accelerates AI initiatives for autonomous vehicles and powers some of the world's most powerful AI supercomputers.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
This document discusses the evolution of computing from PCs to mobile-cloud to AI and IoT. It highlights how deep learning using GPUs has become a new computing model, with neural network complexity exploding to tackle increasingly complex challenges. It introduces Nvidia's Volta GPU and how it delivers revolutionary performance for deep learning training and inference through new tensor cores and optimizations for deep learning frameworks and models.
The NVIDIA Tesla K80 GPU accelerator delivers significantly faster performance than CPUs and previous GPU models for demanding workloads. It features two GPUs with up to 2.91 TFLOPS double precision and 8.74 TFLOPS single precision performance each, 24GB of total memory, and other advanced capabilities. Benchmark results show the K80 providing up to 2.2x faster performance than the Tesla K20X, up to 2.5x faster than the Tesla K10, and up to 10x faster than CPUs for real-world applications in scientific computing, data analytics, and other fields.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Jensen Huang, Founder and CEO of NVIDIA, discusses the new era of AI and how GPU computing is driving innovations in deep learning. He highlights NVIDIA's role in accelerating scientific breakthroughs using AI, developing the Holodeck for virtual design collaboration, and powering autonomous machines and vehicles. Huang also announces that Taiwan's Ministry of Science and Technology has adopted NVIDIA's platform to develop AI capabilities in manufacturing, IoT, healthcare and more.
This document discusses NVIDIA's chips for automotive, HPC, and networking. For automotive, it describes the Tegra line of SOC chips used in cars like Tesla, and upcoming chips like Orin and Atlan. For HPC, it introduces the upcoming Grace CPU designed for giant AI models. For networking, it presents the BlueField line of data processing units (DPUs) including the new 400Gbps BlueField-3 chip and the DOCA software framework. The document emphasizes that NVIDIA's GPU, CPU, and DPU chips make yearly leaps while sharing a common architecture.
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
This document provides an overview of RAPIDS, an open source suite of libraries for GPU-accelerated data science. It discusses how RAPIDS uses GPUs to accelerate ETL, machine learning, and other data science workflows. Key points include:
- RAPIDS includes libraries like cuDF for dataframes, cuML for machine learning, and cuGraph for graph analytics. It aims to provide familiar Python APIs for these tasks.
- cuDF provides over 10x speedups for ETL tasks like data loading, transformations, and feature engineering by keeping data on the GPU.
- cuML provides GPU-accelerated versions of popular scikit-learn algorithms like linear regression, random forests,
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudUniva, an Altair Company
This document discusses Multi-Instance GPU (MIG) on the NVIDIA A100 GPU. MIG allows a single GPU to be partitioned into multiple isolated GPU instances. This increases GPU utilization by right-sizing allocations to workloads. MIG provides benefits like guaranteed quality of service, versatile profiles to maximize utilization, and no required code changes. MIG is well-suited for workloads that don't use the full GPU, like HPC, prototyping, inference, and light training. It allows optimizing GPU utilization and expanding access to more users simultaneously with predictable performance.
This is a presentation I presented at NVIDIA AI Conference in Korea. It's about building the largest GPU - DGX-2, the most powerful supercomputer in one node.
JMI Techtalk: 한재근 - How to use GPU for developing AILablup Inc.
이 Techtalk에서는 AI 개발을 위해 GPU를 사용할 때 Nvidia가 제공하는 성능 향상을 위한 다양한 방법들을 기술자료들과 함께 소개합니다. 특히 Volta 아키텍처를 기반으로 Mixed precision을 도입하여 성능을 향상하는 과정에 관한 내용을 자세히 다룹니다.
This Techtalk introduces a variety of ways to improve the performance that Nvidia provides when using the GPU for AI development, along with technical resources. In particular, this talk discusses the process of improving performance by introducing mixed precision based on the Volta architecture.
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
This was presented by Peng Fei GOU (IBM China) at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/68/NVDLA%20on%20OpenCAPI.pdf
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
The document discusses NVIDIA's role in powering the world's fastest supercomputers. It notes that the Summit supercomputer at Oak Ridge National Laboratory is now the fastest system, powered by 27,648 Volta Tensor Core GPUs to achieve over 122 petaflops. NVIDIA GPUs also power 17 of the world's 20 most energy efficient supercomputers, including Europe's fastest Piz Daint and Japan's fastest Fugaku supercomputer. Over 550 applications are now accelerated using NVIDIA GPUs.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
- The document discusses the latest HPC technologies used in AI/Big Data infrastructures such as TSUBAME3.0 at Tokyo Institute of Technology and ABCI at AIST.
- It provides an overview of the capabilities and achievements of these supercomputers, including TSUBAME2.0 receiving the 2011 ACM Gordon Bell Prize.
- It emphasizes that future supercomputers need to focus on "BYTES" capabilities like bandwidth and capacity to better support large-scale data processing for AI/Big Data applications.
Supercomputing has swept rapidly from the far edges of science to the heart of our everyday lives. And propelling it forward – bringing it into the mobile phone already in your pocket and the car in your driveway – is GPU acceleration, NVIDIA CEO Jen-Hsun Huang told a packed house at a rollicking event kicking off this week’s SC15 annual supercomputing show in Austin. The event draws 10,000 researchers, national lab directors and others from around the world.
This document discusses the evolution of data storage needs from traditional structured data to modern unstructured data like objects and machine data. It outlines the four industrial revolutions defined by major technological advances. Pure Storage's FlashBlade is introduced as the industry's first data hub purpose-built for AI and deep learning, with massively parallel architecture powered by Purity software to scale without limits. Real-world customer examples demonstrate how FlashBlade accelerates AI initiatives for autonomous vehicles and powers some of the world's most powerful AI supercomputers.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
This document discusses the evolution of computing from PCs to mobile-cloud to AI and IoT. It highlights how deep learning using GPUs has become a new computing model, with neural network complexity exploding to tackle increasingly complex challenges. It introduces Nvidia's Volta GPU and how it delivers revolutionary performance for deep learning training and inference through new tensor cores and optimizations for deep learning frameworks and models.
The NVIDIA Tesla K80 GPU accelerator delivers significantly faster performance than CPUs and previous GPU models for demanding workloads. It features two GPUs with up to 2.91 TFLOPS double precision and 8.74 TFLOPS single precision performance each, 24GB of total memory, and other advanced capabilities. Benchmark results show the K80 providing up to 2.2x faster performance than the Tesla K20X, up to 2.5x faster than the Tesla K10, and up to 10x faster than CPUs for real-world applications in scientific computing, data analytics, and other fields.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
This document discusses NVIDIA's technologies for artificial intelligence and accelerated computing. It highlights NVIDIA's GPUs, systems, SDKs, and frameworks that power AI workloads at scale. These include the H100 GPU, DGX systems, Triton inference server, RAPIDS libraries, and Omniverse platform for simulation and digital twins. The document also outlines key applications and industries that are being accelerated by NVIDIA's technologies like autonomous vehicles, healthcare, robotics, and more.
The document discusses NVIDIA's full stack computing platform and ecosystem. It highlights that NVIDIA now has over 2.5 million developers, 24 million CUDA downloads, and powers over 1 billion GPUs. It then summarizes several key technologies in NVIDIA's portfolio including Triton inference serving, Jarvis conversational AI, Maxine virtual collaboration, and developments in HPC and quantum computing simulation.
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
Artificial Intelligence is impacting all areas of society, from healthcare and transportation to smart cities and energy. How NVIDIA invests both in internal pure research and accelerated computation to enable its diverse customer base, across gaming & extended reality, graphics, AI, robotics, simulation, high performance scientific computing, healthcare & more. You will be introduced to the GPU computing platform & shown real world successfully deployed applications as well as a glimpse into the current state of the art across academia, enterprise and startups.
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
Broadening support for GPU-accelerated supercomputing to a fast-growing new platform, NVIDIA founder and CEO Jensen Huang introduced a reference design for building GPU-accelerated Arm servers, with wide industry backing.
1) NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine Learning (30 mins):
About the speaker:
Dr. Gabriel Noaje, Senior Solutions Architect, NVIDIA
http://bit.ly/GabrielNoaje
2) GPUs in Data Science Pipelines ( 30 mins)
- GPU as a Service for enterprise AI
- A short demo on the usage of GPUs for model training and model inferencing within a data science workflow
About the speaker:
Anant Gandhi, Solutions Engineer, Iguazio Singapore. https://www.linkedin.com/in/anant-gandhi-b5447614/
Get updates about OpenACC. This month focuses on: A new OpenACC Online Course, book and number of exciting events highlighted in the OpenACC September Update
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.
Jomar Silva
Gerente de relacionamento com desenvolvedores para a América Latina - NVIDIA
https://eventos.ecommercebrasil.com.br/forum/
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
RAPIDS accelerates data science and machine learning workflows in Python by leveraging GPUs. It includes cuDF for GPU-accelerated pandas functionality, cuML for scikit-learn compatible machine learning algorithms, cuGraph for graph analytics, and integrations with Dask and Spark. RAPIDS has a large community of contributors and is used by many Fortune 100 companies to speed up workflows, reduce costs, and scale to large datasets.
In this deck from FOSDEM'19, Thomas Schwinge presents: Speeding up Programs with OpenACC in GCC.
"Proven in production use for decades, GCC (the GNU Compiler Collection) offers C, C++, Fortran, and other compilers for a multitude of target systems. Over the last few years, we -- formerly known as "CodeSourcery", now a group in "Mentor, a Siemens Business" -- added support for the directive-based OpenACC programming model. Requiring only few changes to your existing source code, OpenACC allows for easy parallelization and code offloading to accelerators such as GPUs. We will present a short introduction of GCC and OpenACC, implementation status, examples, and performance results.
OpenACC is a user-driven directive-based performance-portable parallel programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model."
Watch the video: https://wp.me/p3RLHQ-jOR
Learn more: https://fosdem.org/2019/
and
https://www.openacc.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Check out the latest in OpenACC this month including the PGI 18.1 release, GTC 2018 activity, paper highlights, upcoming events and a call for paper submissions.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
The document discusses several pillars for national AI initiatives, including establishing AI centers of excellence, reskilling the workforce, and investing in key industries to drive growth and solve economic and social challenges. It also outlines different approaches for designing and optimizing AI systems, such as using GANs and GPU-accelerated simulations. Overall, the document promotes the development and application of AI through collaboration between universities, industry, and government.
Stay up-to-date with the OpenACC Monthly Highlights. February's edition covers the updated specification OpenACC 3.2, upcoming GPU Hackathons and Bootcamps, OpenACC's BOF at SC21 , recent research, new resources and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Similar to Application Optimisation using OpenPOWER and Power 9 systems (20)
The document describes a 5-day residency program hosted by the OpenPOWER Academic Discussion Group (ADG) at NIE Mysore from June 6-10, 2022. The program aims to bridge industry and academia knowledge in chip design by developing curriculum on OpenPOWER technology and training lab assistants. Engineers and academicians with 5+ years experience in chip design/verification are eligible to participate. They will collaborate on developing course materials and lab exercises to teach undergraduate students in fields like ECE and CSE. The program seeks to help fulfill India's goals in chip design manpower and self-reliance through initiatives like Make in India and the India Semiconductor Mission.
This document provides an overview of digital design and Verilog. It discusses binary numbers and boolean algebra as the foundation of digital systems. It also describes logic gates, combinational and sequential circuits, finite state machines, and datapath and control units. Finally, it introduces Verilog, describing different modeling types like gate level, behavioral, dataflow, and switch level modeling. It positions Verilog as a hardware description language used to more easily design digital circuits compared to manual drawing.
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
The document discusses AI in the enterprise, including use cases, infrastructure considerations, and the AI lifecycle. It provides examples of how AI can be applied in various industries and common patterns of analytics using AI. It also outlines the data science model development workflow and considerations for AI infrastructure, software, and data management throughout the AI lifecycle.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
4. 4
CUDA POWERS ACCELERATED COMPUTING
DEFINING THE NEXT GIANT WAVE IN
HPC
Over 80x in 3 Years
Googlenet Training Performance Vs K80
DL FRAMEWORKS AND HPC
APPLICATIONS ACCELERATED
+400 More
Applications
1x K80
4x M40
8x P100
8x V100
0x
20x
40x
60x
80x
100x
127 Systems on Top 500
World’s #1 Summit: 144 PF
World’s #2 Sierra: 95 PF
Europe’s #1 Piz Daint: 21 PF
Japan’s #1 ABCI: 20 PF
Industrial #1 ENI: 12 PF
5. 5
SMALL CHANGES, BIG SPEED-UP
Application Code
+
GPU CPU
5% of Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
6. 6
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
7. 7
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
8. 8
LIBRARIES: EASY, HIGH-QUALITY ACCELERATION
Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
Libraries offer high-quality implementations of functions
encountered in a broad range of applications
NVIDIA libraries are tuned by experts
EASE OF USE
“DROP-IN”
QUALITY
PERFORMANCE
9. 9
DEEP LEARNING
GPU ACCELERATED LIBRARIES
“Drop-in” Acceleration for Your Applications
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
nvGRAPH NCCL
cuBLAS
cuSPARSE cuRAND
DeepStream SDK NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
CODEC SDKcuDNN
10. 10
3 STEPS TO CUDA-ACCELERATED APPLICATION
Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
Step 2: Manage data locality
- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
Step 3: Rebuild and link the CUDA-accelerated library
gcc myobj.o –l cublas
11. 11
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
12. 12
OpenACC is a directives-
based programming approach
to parallel computing
designed for performance
and portability on CPUs
and GPUs for HPC.
main()
{
<serial code>
#pragma acc kernels
{
<parallel code>
}
}
Add Simple Compiler Directive
13. 13
Janus Juul Eriksen, PhD Fellow
qLEAP Center for Theoretical Chemistry, Aarhus University
“
OpenACC makes GPU computing approachable for
domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
“
LSDALTON
Large-scale application for calculating high-
accuracy molecular energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines 1 Week 1 Source
Big Performance
7.9x
8.9x
11.7x
ALANINE-1
13 ATOMS
ALANINE-2
23 ATOMS
ALANINE-3
33 ATOMSSpeedupvsCPU
Minimal Effort
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
https://developer.nvidia.com/openacc/success-stories
14. 14
SINGLE CODE FOR MULTIPLE PLATFORMS
POWER
Sunway
x86 CPU
x86 Xeon Phi
NVIDIA GPU
PEZY-SC
9x 10x 11x
52x
9x 10x 11x
77x
0x
20x
40x
60x
80x
SpeedupvsSingleHaswellCore
PGI OpenACC
Intel OpenMP
IBM OpenMP
Dual Haswell Dual Broadwell Dual POWER8 1 Tesla
P100
OpenACC - Performance Portable Programming Model for HPC
1 Tesla
V100
AWE Hydrodynamics CloverLeaf mini-App, bm32 data set
Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Minsky: POWER8+NVLINK, four P100s,
RHEL 7.3 (gsn1).
Compilers: Intel 17.0, IBM XL 13.1.3, PGI 16.10.
Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC
(MPI+OpenACC)
Data compiled by PGI November 2016, Volta data collected June 2017
15. 15
TOP HPC APPS ADOPTING OPENACC
OpenACC - Performance Portability And Ease of Programming
ANSYS Fluent
VASP
3 of Top 10 Apps
5 CSCS Codes
COSMO
ELEPHANT
RAMSES
ICON
ORB5
GTC
XGC
ACME
FLASH
LSDalton
5 ORNL CAAR
Codes
30000
22500
15000
7500
0
T4 T8 T14 T28
Time(S)
CPU (cores)
CPU: (Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @2.30GHz, 2 sockets, 28 cores
GPU: Tesla K80 12+12 GB, Driver 346.46
Fluent Native Solver
Fluent HTC Solver K80 GPU
ANSYS Fluent R18.0 Radiation SolverGaussian
17. 17
OpenACC DIRECTIVES EXAMPLE
!$acc data copy(A,Anew)
iter=0
do while ( err > tol .and. iter < iter_max )
iter = iter +1
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) &
+A(i ,j-1) + A(i ,j+1))
err = max( err, Anew(i,j)-A(i,j))
end do
end do
!$acc end kernels
IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err
A= Anew
end do
!$acc end data
Copy arrays into GPU memory
within data region
Parallelize code inside region
Close off parallel region
Close off data region,
copy data back
18. 18
OPENACC FOR EVERYONE
New PGI Community Edition Now Available
PROGRAMMING MODELS
OpenACC, CUDA Fortran, OpenMP,
C/C++/Fortran Compilers and Tools
PLATFORMS
X86, OpenPOWER, NVIDIA GPU
UPDATES 1-2 times a year 6-9 times a year 6-9 times a year
SUPPORT User Forums PGI Support
PGI Professional
Services
LICENSE Annual Perpetual Volume/Site
FREE
20. 20
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
21. 21
CUDA Fortran, OpenACCFortran
CUDA C++, OpenACCC, C++
CUDA Python, PyCUDAPython
MATLAB, Mathematica, LabVIEWNumerical analytics
GPU PROGRAMMING LANGUAGES
Altimesh Hybridizer, Alea GPUC#
22. 22
void saxpy_serial(int n,
float a,
float *x,
float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Perform SAXPY on 1M elements
saxpy_serial(4096*256, 2.0, x, y);
__global__
void saxpy_parallel(int n,
float a,
float *x,
float *y)
{
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Perform SAXPY on 1M elements
saxpy_parallel<<<4096,256>>>(n,2.0,x,y);
CUDA C
Standard C Code Parallel C Code
http://developer.nvidia.com/cuda-toolkit
23. 23
CUDA C++: DEVELOP GENERIC PARALLEL CODE
Class hierarchies
__device__ methods
Templates
Operator overloading
Functors (function objects)
Device-side new/delete
More…
template <typename T>
struct Functor {
__device__ Functor(_a) : a(_a) {}
__device__ T operator(T x) { return a*x; }
T a;
}
template <typename T, typename Oper>
__global__ void kernel(T *output, int n) {
Oper op(3.7);
output = new T[n]; // dynamic allocation
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
output[i] = op(i); // apply functor
}
http://developer.nvidia.com/cuda-toolkit
CUDA C++ features enable
sophisticated and flexible
applications and middleware
24. 24
// generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end(),
` rand);
// transfer data to device (GPU)
thrust::device_vector<int> d_vec = h_vec;
// sort data on device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(),
d_vec.end(),
h_vec.begin());
RAPID PARALLEL C++ DEVELOPMENT
• Resembles C++ STL
• High-level interface
• Enhances developer productivity
• Enables performance portability
between GPUs and multicore CPUs
• Flexible
• CUDA, OpenMP, and TBB backends
• Extensible and customizable
• Integrates with existing software
• Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com
25. 25
CUDA FORTRAN
• Program GPU using Fortran
• Key language for HPC
• Simple language extensions
• Kernel functions
• Thread / block IDs
• Device & data
management
• Parallel loop directives
• Familiar syntax
• Use allocate, deallocate
• Copy CPU-to-GPU with
assignment (=)
module mymodule contains
attributes(global) subroutine saxpy(n,a,x,y)
real :: x(:), y(:), a,
integer n, i
attributes(value) :: a, n
i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if (i<=n) y(i) = a*x(i) + y(i);
end subroutine saxpy
end module mymodule
program main
use cudafor; use mymodule
real, device :: x_d(2**20), y_d(2**20)
x_d = 1.0; y_d = 2.0
call saxpy<<<4096,256>>>(2**20,3.0,x_d,y_d,)
y = y_d
write(*,*) 'max error=', maxval(abs(y-5.0))
end program main
http://developer.nvidia.com/cuda-fortran
26. 26
MATLAB
http://www.mathworks.com/discovery/matlab-gpu.html
GET STARTED TODAY
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop or desktop PC!
CUDA C/C++
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
http://developer.nvidia.com/thrust
CUDA Fortran
http://developer.nvidia.com/cuda-toolkit
CUDA Python
http://developer.nvidia.com/how-to-cuda-python
Mathematica
http://www.wolfram.com/mathematica/new
-in-8/cuda-and-opencl-support/
28. 28
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE Plugins
Nsight Eclipse
Edition/Visual Studio
(Editor, Debugger)
29. 29
NSIGHT SYSTEMS
Observe Application Behavior: CPU threads, GPU
traces, Memory Bandwidth and more
Locate Optimization Opportunities: CUDA &
OpenGL APIs, Unified Memory transfers, User
Annotations using NVTX
Ready for Big Data: Fast GUI capable of visualizing
in excess of 10 million events on laptops, Container
support, Minimum user privileges
System-wide Performance Analysis
https://developer.nvidia.com/nsight-systems
30. 30
Processes and
threads
CUDA and OpenGL
API trace
Multi-GPU
Kernel and memory
transfer activities
cuDNN and
cuBLAS trace
Thread/core
migration
Thread state
31. 31
NVIDIA NSIGHT COMPUTE
Next Generation Kernel Profiler
Interactive CUDA API debugging and kernel
profiling
Fast Data Collection
Improved Workflow and Fully Customizable
(Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux (x86, ARM), Windows
GPUs: Pascal, Volta, Turing
Kernel
Profile
Comparisons
with
Baseline
Metric Data
Source
Correlation
32. 32
SIX WAYS TO SAXPY
Programming Languages for GPU Computing
33. 33
SINGLE PRECISION ALPHA X PLUS Y (SAXPY)
Part of Basic Linear Algebra Subroutines (BLAS) Library
GPU SAXPY in multiple languages and libraries
A menagerie* of possibilities, not a tutorial
𝒛 = 𝛼𝒙 + 𝒚
x, y, z : vector
: scalar
*technically, a program chrestomathy: http://en.wikipedia.org/wiki/Chrestomathy
34. 34
subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
!$acc kernels
do i=1,n
y(i) = a*x(i)+y(i)
enddo
!$acc end kernels
end subroutine saxpy
...
! Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d, y_d)
...
void saxpy(int n,
float a,
float *x,
float *y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
OpenACC COMPILER DIRECTIVES
Parallel C Code Parallel Fortran Code
http://developer.nvidia.com/openacc or http://openacc.org
35. 35
int N = 1<<20;
...
// Use your choice of blas library
// Perform SAXPY on 1M elements
blas_saxpy(N, 2.0, x, 1, y, 1);
int N = 1<<20;
cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasShutdown();
cuBLAS LIBRARY
Serial BLAS Code Parallel cuBLAS Code
http://developer.nvidia.com/cublas
You can also call cuBLAS from Fortran,
C++, Python, and other languages
36. 36
void saxpy(int n, float a,
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
int N = 1<<20;
// Perform SAXPY on 1M elements
saxpy(N, 2.0, x, y);
__global__
void saxpy(int n, float a,
float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int N = 1<<20;
cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);
cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost);
CUDA CStandard C Parallel C
http://developer.nvidia.com/cuda-toolkit
37. 37
int N = 1<<20;
std::vector<float> x(N), y(N);
...
// Perform SAXPY on 1M elements
std::transform(x.begin(), x.end(),
y.begin(), y.end(),
2.0f * _1 + _2);
int N = 1<<20;
thrust::host_vector<float> x(N), y(N);
...
thrust::device_vector<float> d_x = x;
thrust::device_vector<float> d_y = y;
// Perform SAXPY on 1M elements
thrust::transform(d_x.begin(), d_x.end(),
d_y.begin(),d_y.begin(),
2.0f * _1 + _2)
THRUST C++ TEMPLATE LIBRARY
Serial C++ Code
with STL and Boost
Parallel C++ Code
http://thrust.github.comwww.boost.org/libs/lambda
38. 38
CUDA FORTRAN
module mymodule contains
attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
attributes(value) :: a, n
i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy
end module mymodule
program main
use cudafor; use mymodule
real, device :: x_d(2**20), y_d(2**20)
x_d = 1.0, y_d = 2.0
! Perform SAXPY on 1M elements
call saxpy<<<4096,256>>>(2**20, 2.0, x_d, y_d)
end program main
http://developer.nvidia.com/cuda-fortran
module mymodule contains
subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
do i=1,n
y(i) = a*x(i)+y(i)
enddo
end subroutine saxpy
end module mymodule
program main
use mymodule
real :: x(2**20), y(2**20)
x = 1.0, y = 2.0
! Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x, y)
end program main
Standard Fortran Parallel Fortran
39. 39
PYTHON
Numba Parallel Python
https://numba.pydata.org
import numpy as np
from numba import vectorize
@vectorize(['float32(float32, float32,
float32)'], target='cuda')
def saxpy(a, x, y):
return a * x + y
N = 1048576
# Initialize arrays
A = np.ones(N, dtype=np.float32)
B = np.ones(A.shape, dtype=A.dtype)
C = np.empty_like(A, dtype=A.dtype)
# Add arrays onGPU
C = saxpy(2.0, X, Y)
import numpy as np
def saxpy(a, x, y):
return [a * xi + yi
for xi, yi in zip(x, y)]
x = np.arange(2**20, dtype=np.float32)
y = np.arange(2**20, dtype=np.float32)
cpu_result = saxpy(2.0, x, y)
http://numpy.scipy.org
Standard Python
40. 40
ENABLING ENDLESS WAYS TO SAXPY
• Build front-ends for Java, Python, R, DSLs
• Target other processors like ARM, FPGA,
GPUs, x86
CUDA
C, C++, Fortran
LLVM Compiler
For CUDA
NVIDIA
GPUs
x86
CPUs
New Language
Support
New Processor
Support
CUDA Compiler Contributed to
Open Source LLVM
41.
42. 42
int N = 1 << 20;
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]
saxpy(N, 2.0, d_x, 1, d_y, 1);
DROP-IN ACCELERATION (STEP 1)
43. 43
int N = 1 << 20;
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
DROP-IN ACCELERATION (STEP 1)
Add “cublas” prefix
and use device variables
44. 44
int N = 1 << 20;
cublasInit();
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasShutdown();
DROP-IN ACCELERATION (STEP 2)
Shut down cuBLAS
Initialize cuBLAS
46. 46
int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
DROP-IN ACCELERATION (STEP 2)
Transfer data to GPU
Read data back GPU
47. 47
2 BASIC STEPS TO GET STARTED
Step 1: Annotate source code with directives:
Step 2: Compile & run:
pgf90 -ta=nvidia -Minfo=accel file.f
!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)
!$acc parallel loop
…
!$acc end parallel
!$acc end data