The document summarizes early experiences using the Summit supercomputer at Oak Ridge National Laboratory. Summit is the world's fastest supercomputer and has been used by several early science projects. Two example applications, GTC and CoMet, have achieved good scaling and performance on Summit. Some initial issues were encountered but addressed. Overall, Summit is a very powerful system but continued software improvements are needed to optimize applications for its complex hardware architecture.
This document proposes a highly parallel semi-dataflow FPGA architecture for accelerating large-scale N-body simulations. The key aspects of the proposed design are: 1) A hardware/software partitioning that accelerates the computationally intensive force calculation step on the FPGA; 2) An optimized data transfer approach to reduce memory traffic; 3) A semi-dataflow architecture providing high parallelism through 48 computation pipelines; and 4) A tiling approach to further improve performance and resource utilization. Experimental results show the design achieves up to 4400 million particle-pairs per second, outperforming CPU and GPU implementations in terms of performance and performance-per-watt.
This document summarizes an approach called Gossip-based Resource Allocation for Green Computing in Large Clouds. The approach uses a distributed middleware architecture and gossip-based algorithms to dynamically consolidate virtual machines on the minimum number of active servers for energy efficiency. It aims to maximize utility and minimize power consumption and reconfiguration costs in large cloud environments with over 100,000 machines. The Generic Resource Management Protocol (GRMP) is presented as a scalable solution that can be instantiated in different ways, such as GRMP-Q, to achieve server consolidation under low load and fair allocation under high load. Simulation results show GRMP-Q reduces power consumption while maintaining satisfied demand and fairness across large systems.
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUMahesh Khadatare
This poster presents Weather Research and Forecast (WRF)
model is a next-generation mesoscale numerical weather
prediction system designed to serve both operational forecasting
and atmospheric research communities. WRF offers multiple
physics options, one of which is the Long-Wave Rapid Radiative
Transfer Model (RRTM). Even with the advent of large-scale
parallelism in weather models, much of the performance increase
has came from increasing processor speed rather than increased
parallelism. We present an alternative method of scaling model
performance by exploiting emerging architectures like GPGPU
using the fine-grain parallelism. We claim to get much more than
23.71x, performance gain by using asynchronous data transfer,
use of texture memory and the techniques like loop unrolling.
A presentation looking at the parallelisation of the Vienna ab initio Simulation Package (VASP) code, how to optimise performance and scaling by tuning the input control tags, and some initial experience with the NVIDIA GPU (CUDA) port of the code, including compiling, running jobs, and some initial benchmark results.
Multi-phase-field simulations with OpenPhasePFHub PFHub
The document describes OpenPhase, an open-source phase field modeling toolbox for simulating microstructure evolution. OpenPhase uses a multi-phase field approach and includes modules for simulating processes like coarsening, diffusion, deformation, plasticity, damage, and fluid flow. It has been under development for over 10 years. The document provides an overview of OpenPhase capabilities and includes an example of using it to simulate Mg-Al alloy solidification, showing the effect of cooling rate on microstructure. It also gives details about setting up and running a simulation using the OpenPhase modules in C++.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
The document summarizes early experiences using the Summit supercomputer at Oak Ridge National Laboratory. Summit is the world's fastest supercomputer and has been used by several early science projects. Two example applications, GTC and CoMet, have achieved good scaling and performance on Summit. Some initial issues were encountered but addressed. Overall, Summit is a very powerful system but continued software improvements are needed to optimize applications for its complex hardware architecture.
This document proposes a highly parallel semi-dataflow FPGA architecture for accelerating large-scale N-body simulations. The key aspects of the proposed design are: 1) A hardware/software partitioning that accelerates the computationally intensive force calculation step on the FPGA; 2) An optimized data transfer approach to reduce memory traffic; 3) A semi-dataflow architecture providing high parallelism through 48 computation pipelines; and 4) A tiling approach to further improve performance and resource utilization. Experimental results show the design achieves up to 4400 million particle-pairs per second, outperforming CPU and GPU implementations in terms of performance and performance-per-watt.
This document summarizes an approach called Gossip-based Resource Allocation for Green Computing in Large Clouds. The approach uses a distributed middleware architecture and gossip-based algorithms to dynamically consolidate virtual machines on the minimum number of active servers for energy efficiency. It aims to maximize utility and minimize power consumption and reconfiguration costs in large cloud environments with over 100,000 machines. The Generic Resource Management Protocol (GRMP) is presented as a scalable solution that can be instantiated in different ways, such as GRMP-Q, to achieve server consolidation under low load and fair allocation under high load. Simulation results show GRMP-Q reduces power consumption while maintaining satisfied demand and fairness across large systems.
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUMahesh Khadatare
This poster presents Weather Research and Forecast (WRF)
model is a next-generation mesoscale numerical weather
prediction system designed to serve both operational forecasting
and atmospheric research communities. WRF offers multiple
physics options, one of which is the Long-Wave Rapid Radiative
Transfer Model (RRTM). Even with the advent of large-scale
parallelism in weather models, much of the performance increase
has came from increasing processor speed rather than increased
parallelism. We present an alternative method of scaling model
performance by exploiting emerging architectures like GPGPU
using the fine-grain parallelism. We claim to get much more than
23.71x, performance gain by using asynchronous data transfer,
use of texture memory and the techniques like loop unrolling.
A presentation looking at the parallelisation of the Vienna ab initio Simulation Package (VASP) code, how to optimise performance and scaling by tuning the input control tags, and some initial experience with the NVIDIA GPU (CUDA) port of the code, including compiling, running jobs, and some initial benchmark results.
Multi-phase-field simulations with OpenPhasePFHub PFHub
The document describes OpenPhase, an open-source phase field modeling toolbox for simulating microstructure evolution. OpenPhase uses a multi-phase field approach and includes modules for simulating processes like coarsening, diffusion, deformation, plasticity, damage, and fluid flow. It has been under development for over 10 years. The document provides an overview of OpenPhase capabilities and includes an example of using it to simulate Mg-Al alloy solidification, showing the effect of cooling rate on microstructure. It also gives details about setting up and running a simulation using the OpenPhase modules in C++.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Multi-core GPU – Fast parallel SAR image generationMahesh Khadatare
This document summarizes a poster presentation about using GPUs to parallelize synthetic aperture radar (SAR) image generation. It describes a proposed parallel algorithm that divides SAR data into blocks and processes the range and azimuth dimensions concurrently on GPU cores. Experimental results show that an Nvidia Fermi S2050 GPU achieved the highest performance and fastest speeds, processing a 1024x1024 SAR image over 74 times faster than a CPU.
Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses CERN's use of Oracle's In-Memory Column Store to perform real-time analysis of physics experiment data from the Large Hadron Collider. Benchmark tests showed significant performance improvements over traditional row-based storage, with analytic queries running 10-100x faster. The columnar format also improved data compression rates. Additionally, OLTP workloads saw no negative impacts. CERN plans to consider the technology for future projects given its ability to enable real-time analysis that was previously not possible.
Performance Optimization of HPC Applications: From Hardware to Source CodeFisnik Kraja
The document summarizes optimization techniques for HPC applications from hardware selection to application code tuning. It describes analyzing application performance, choosing an appropriate system, efficiently using resources, tuning system parameters, and optimizing code. Examples are provided for AVL Fire and OpenFOAM simulations, analyzing scalability, hardware dependencies, and reducing runtime through MPI and system tuning.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
This document discusses customizing a deep learning accelerator. It begins with a demonstration of object detection using a Tiny YOLO v1 model on an FPGA-based prototype. It then discusses designing a high-efficiency accelerator with three steps: 1) increasing MAC processing elements and utilization, 2) increasing data supply, and 3) improving energy efficiency. Various neural network models are profiled to analyze memory bandwidth and computational power tradeoffs. The document proposes a customizable architecture and discusses solutions like layer fusion, quantization-aware training, and post-training quantization. Performance estimates using an equation-based profiler for sample models are provided to demonstrate the customized accelerator design.
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
This document summarizes benchmark testing of a cubed sphere climate modeling application on a multi-core cluster. The testing showed that using fewer cores per node improved performance. Runtime was reduced by 38% when using 2 cores per node instead of 8 cores. MPI performance and cache access times also degraded with increased core density per node. Overall, the results indicate that job scheduling should aim to use fewer cores per node to optimize runtime in multi-core environments where resource contention can occur.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
GTC Japan 2016 Chainer feature introductionKenta Oono
This document introduces Chainer's new trainer and dataset abstraction features which provide a standardized way to implement training loops and access datasets. The key aspects are:
- Trainer handles the overall training loop and allows extensions to customize checkpoints, logging, evaluation etc.
- Updater handles fetching mini-batches and model optimization within each loop.
- Iterators handle accessing datasets and returning mini-batches.
- Extensions can be added to the trainer for tasks like evaluation, visualization, and saving snapshots.
This abstraction makes implementing training easier and more customizable while still allowing manual control when needed. Common iterators, updaters, and extensions are provided to cover most use cases.
Hpc, grid and cloud computing - the past, present, and future challengeJason Shih
This document discusses trends in high performance computing (HPC), grid computing, and cloud computing. It provides an overview of HPC cluster performance and interconnects. Grid computing enabled large-scale scientific collaboration through infrastructures like EGEE. The LHC requires petascale computing capabilities. Cloud computing hype is discussed alongside observations of performance and virtualization challenges. The future of computing may involve more sophisticated tools and dynamic, small computing elements.
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Frank Ham from Cascade Technologies presented this deck at the Stanford HPC Conference.
"A spin-off of the Center for Turbulence Research at Stanford University, Cascade Technologies grew out of a need to bridge between fundamental research from institutions like Stanford University and its application in industries. In a continual push to improve the operability and performance of combustion devices, high-fidelity simulation methods for turbulent combustion are emerging as critical elements in the design process. Multiphysics based methodologies can accurately predict mixing, study flame structure and stability, and even predict product and pollutant concentrations at design and off-design conditions."
Watch the video: http://insidehpc.com/2017/02/best-practices-large-scale-multiphysics/
Learn more: http://www.cascadetechnologies.com
and
http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/
Sign up for our insideHPC Newsletter: http:/insidehpc.com/newsletter
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
This document discusses optimizations for deep learning frameworks on Intel CPUs and Fugaku processors. It introduces oneDNN, an Intel performance library for deep neural networks. JIT assembly using Xbyak is proposed to generate optimized code depending on parameters at runtime. Xbyak has been extended to AArch64 as Xbyak_aarch64 to support Fugaku. AVX-512 SIMD instructions are briefly explained.
The document provides a status report on testing the Helix Nebula Science Cloud for interactive data analysis by end users of the TOTEM experiment. It summarizes the deployment of a "Science Box" platform on the Helix Nebula Cloud using technologies like EOS, CERNBox, SWAN and SPARK. Initial tests of the platform were successful in 2017 using a single VM. Current tests involve a scalable deployment with Kubernetes and using SPARK as the computing engine. Synthetic benchmarks and a TOTEM data analysis example show the platform is functioning well with room to scale out storage and computing resources for larger datasets and analyses.
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
This document summarizes a research paper on a gossip-based resource allocation protocol called GRMP-Q for server consolidation in large cloud environments. The protocol aims to minimize active servers and allocate resources fairly while adapting dynamically to load changes. It uses a distributed middleware architecture and gossip algorithms to provide scalability without single points of failure. Simulation results show GRMP-Q reduces power usage by shutting down servers, satisfies demand fairly, and reconfigures with low cost compared to optimal solutions. Future work areas include analyzing convergence, supporting heterogeneity, and expanding the architecture.
- Python has become a robust platform for scientific and engineering work, from data analysis to modeling and visualization. It has clear syntax, many open source libraries, and can be used across operating systems.
- This document discusses the history and advantages of using Python for earth sciences, including modeling hydrodynamics and earthquakes. It also provides examples of using Python with libraries like NumPy, SciPy, and matplotlib for tasks like data analysis, visualization, and GIS processing.
- Python is now widely used for tasks that previously required other languages or programs, offering an integrated environment while maintaining high performance via compiled extensions.
- Hardware such as DRAM and NAND flash are facing scaling challenges as density increases, which could impact performance and cost. New non-volatile memory (NVM) technologies may provide opportunities to address these challenges but require software and system architecture changes to realize their full potential. Key considerations include persistence, performance, and programming models.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
AI Solutions for Industries | Quality Inspection | Data Insights | AI-accelerated CFD | Self-Checkout | byteLAKE.com
byteLAKE: Empowering Industries with AI Solutions. Embrace cutting-edge technology for advanced quality inspection, data insights, and more. Harness the potential of our CFD Suite, accelerating Computational Fluid Dynamics for heightened productivity. Unlock new possibilities with Cognitive Services: image analytics for precise visual inspection for Manufacturing, sound analytics enabling proactive maintenance for Automotive, and wet line analytics for the Paper Industry. Seamlessly convert data into actionable insights using Data Insights' AI module, enabling advanced predictive maintenance and risk detection. Simplify Restaurant and Retail operations with our efficient self-checkout solution, recognizing meals and groceries and elevating customer satisfaction. Custom AI Development services available for tailored solutions. Discover more at www.byteLAKE.com.
Multi-core GPU – Fast parallel SAR image generationMahesh Khadatare
This document summarizes a poster presentation about using GPUs to parallelize synthetic aperture radar (SAR) image generation. It describes a proposed parallel algorithm that divides SAR data into blocks and processes the range and azimuth dimensions concurrently on GPU cores. Experimental results show that an Nvidia Fermi S2050 GPU achieved the highest performance and fastest speeds, processing a 1024x1024 SAR image over 74 times faster than a CPU.
Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses CERN's use of Oracle's In-Memory Column Store to perform real-time analysis of physics experiment data from the Large Hadron Collider. Benchmark tests showed significant performance improvements over traditional row-based storage, with analytic queries running 10-100x faster. The columnar format also improved data compression rates. Additionally, OLTP workloads saw no negative impacts. CERN plans to consider the technology for future projects given its ability to enable real-time analysis that was previously not possible.
Performance Optimization of HPC Applications: From Hardware to Source CodeFisnik Kraja
The document summarizes optimization techniques for HPC applications from hardware selection to application code tuning. It describes analyzing application performance, choosing an appropriate system, efficiently using resources, tuning system parameters, and optimizing code. Examples are provided for AVL Fire and OpenFOAM simulations, analyzing scalability, hardware dependencies, and reducing runtime through MPI and system tuning.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
This document discusses customizing a deep learning accelerator. It begins with a demonstration of object detection using a Tiny YOLO v1 model on an FPGA-based prototype. It then discusses designing a high-efficiency accelerator with three steps: 1) increasing MAC processing elements and utilization, 2) increasing data supply, and 3) improving energy efficiency. Various neural network models are profiled to analyze memory bandwidth and computational power tradeoffs. The document proposes a customizable architecture and discusses solutions like layer fusion, quantization-aware training, and post-training quantization. Performance estimates using an equation-based profiler for sample models are provided to demonstrate the customized accelerator design.
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
This document summarizes benchmark testing of a cubed sphere climate modeling application on a multi-core cluster. The testing showed that using fewer cores per node improved performance. Runtime was reduced by 38% when using 2 cores per node instead of 8 cores. MPI performance and cache access times also degraded with increased core density per node. Overall, the results indicate that job scheduling should aim to use fewer cores per node to optimize runtime in multi-core environments where resource contention can occur.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
GTC Japan 2016 Chainer feature introductionKenta Oono
This document introduces Chainer's new trainer and dataset abstraction features which provide a standardized way to implement training loops and access datasets. The key aspects are:
- Trainer handles the overall training loop and allows extensions to customize checkpoints, logging, evaluation etc.
- Updater handles fetching mini-batches and model optimization within each loop.
- Iterators handle accessing datasets and returning mini-batches.
- Extensions can be added to the trainer for tasks like evaluation, visualization, and saving snapshots.
This abstraction makes implementing training easier and more customizable while still allowing manual control when needed. Common iterators, updaters, and extensions are provided to cover most use cases.
Hpc, grid and cloud computing - the past, present, and future challengeJason Shih
This document discusses trends in high performance computing (HPC), grid computing, and cloud computing. It provides an overview of HPC cluster performance and interconnects. Grid computing enabled large-scale scientific collaboration through infrastructures like EGEE. The LHC requires petascale computing capabilities. Cloud computing hype is discussed alongside observations of performance and virtualization challenges. The future of computing may involve more sophisticated tools and dynamic, small computing elements.
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Frank Ham from Cascade Technologies presented this deck at the Stanford HPC Conference.
"A spin-off of the Center for Turbulence Research at Stanford University, Cascade Technologies grew out of a need to bridge between fundamental research from institutions like Stanford University and its application in industries. In a continual push to improve the operability and performance of combustion devices, high-fidelity simulation methods for turbulent combustion are emerging as critical elements in the design process. Multiphysics based methodologies can accurately predict mixing, study flame structure and stability, and even predict product and pollutant concentrations at design and off-design conditions."
Watch the video: http://insidehpc.com/2017/02/best-practices-large-scale-multiphysics/
Learn more: http://www.cascadetechnologies.com
and
http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/
Sign up for our insideHPC Newsletter: http:/insidehpc.com/newsletter
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
This document discusses optimizations for deep learning frameworks on Intel CPUs and Fugaku processors. It introduces oneDNN, an Intel performance library for deep neural networks. JIT assembly using Xbyak is proposed to generate optimized code depending on parameters at runtime. Xbyak has been extended to AArch64 as Xbyak_aarch64 to support Fugaku. AVX-512 SIMD instructions are briefly explained.
The document provides a status report on testing the Helix Nebula Science Cloud for interactive data analysis by end users of the TOTEM experiment. It summarizes the deployment of a "Science Box" platform on the Helix Nebula Cloud using technologies like EOS, CERNBox, SWAN and SPARK. Initial tests of the platform were successful in 2017 using a single VM. Current tests involve a scalable deployment with Kubernetes and using SPARK as the computing engine. Synthetic benchmarks and a TOTEM data analysis example show the platform is functioning well with room to scale out storage and computing resources for larger datasets and analyses.
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
This document summarizes a research paper on a gossip-based resource allocation protocol called GRMP-Q for server consolidation in large cloud environments. The protocol aims to minimize active servers and allocate resources fairly while adapting dynamically to load changes. It uses a distributed middleware architecture and gossip algorithms to provide scalability without single points of failure. Simulation results show GRMP-Q reduces power usage by shutting down servers, satisfies demand fairly, and reconfigures with low cost compared to optimal solutions. Future work areas include analyzing convergence, supporting heterogeneity, and expanding the architecture.
- Python has become a robust platform for scientific and engineering work, from data analysis to modeling and visualization. It has clear syntax, many open source libraries, and can be used across operating systems.
- This document discusses the history and advantages of using Python for earth sciences, including modeling hydrodynamics and earthquakes. It also provides examples of using Python with libraries like NumPy, SciPy, and matplotlib for tasks like data analysis, visualization, and GIS processing.
- Python is now widely used for tasks that previously required other languages or programs, offering an integrated environment while maintaining high performance via compiled extensions.
- Hardware such as DRAM and NAND flash are facing scaling challenges as density increases, which could impact performance and cost. New non-volatile memory (NVM) technologies may provide opportunities to address these challenges but require software and system architecture changes to realize their full potential. Key considerations include persistence, performance, and programming models.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
AI Solutions for Industries | Quality Inspection | Data Insights | AI-accelerated CFD | Self-Checkout | byteLAKE.com
byteLAKE: Empowering Industries with AI Solutions. Embrace cutting-edge technology for advanced quality inspection, data insights, and more. Harness the potential of our CFD Suite, accelerating Computational Fluid Dynamics for heightened productivity. Unlock new possibilities with Cognitive Services: image analytics for precise visual inspection for Manufacturing, sound analytics enabling proactive maintenance for Automotive, and wet line analytics for the Paper Industry. Seamlessly convert data into actionable insights using Data Insights' AI module, enabling advanced predictive maintenance and risk detection. Simplify Restaurant and Retail operations with our efficient self-checkout solution, recognizing meals and groceries and elevating customer satisfaction. Custom AI Development services available for tailored solutions. Discover more at www.byteLAKE.com.
The document summarizes benchmarking results for four magnetic fusion simulation codes: GTS, TGYRO, BOUT++, and VORPAL. It was performed on the Cray XE6 "Hopper" supercomputer at NERSC to evaluate performance, scalability, memory usage, and communication overhead at large scales. For GTS, weak scaling tests showed computation time remained constant while communication time increased slightly with up to 49,152 cores. Testing also examined the codes' sensitivity to reduced memory bandwidth by increasing core count per node. Overall results provide insight to improve fusion code design and inform exascale co-design efforts.
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
In this deck from PASC18, Robert Searles from the University of Delaware presents: Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures.
"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling.
We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif."
Watch the video: https://wp.me/p3RLHQ-iPU
Read the Full Paper: https://doi.org/10.1145/3218176.3218228
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
Towards Exascale Simulations of Stellar Explosions with FLASHGanesan Narayanasamy
- ORNL is managed by UT-Battelle for the US Department of Energy and conducts research including simulations of stellar explosions using the FLASH code.
- The research aims to prepare FLASH to run on the upcoming Summit supercomputer by accelerating components like the nuclear kinetics module using GPUs.
- Preliminary results show significant speedups from using GPUs for large nuclear reaction networks that were previously too computationally expensive.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Artificial Neural Networks for Storm Surge Prediction in North CarolinaAnton Bezuglov
Feedforward Artificial Neural network (FF ANN) for storm surge prediction in North Carolina. Presentation at Coastal Resilience Center by Anton Bezuglov, Ph.D. Usage of TensorFlow and Python with links to the code on GitHub.
In this video from the 2015 Stanford HPC Conference, Pavel Shamis from ORNL presents: Preparing OpenSHMEM for Exascale.
"OpenSHMEM is a partitioned global address space (PGAS) one-sided communications library that enables remote memory access (RMA) across processing elements (PEs). Its API allows data to be transferred from one PE memory space to another PE’s symmetric memory space; decoupling the data transfers from synchronizations. OpenSHMEM is useful for applications that are latency driven or that have irregular communication patterns, because its one-sided API can be mapped very efficiently to hardware (e.g. RDMA interconnects, etc), and its one-sided programming model helps the overlapping of communication with computation. Summit is Oak Ridge National Laboratory’s next high performance supercomputer system that will be based on a many core/GPU hybrid architecture. In order to prepare OpenSHMEM for future systems, it is important to enhance its programming model to enable efficient utilization of the new hardware capabilities (e.g. massive multithreaded systems, accesses different type memories, next generation of interconnects, etc). This session will present recent advances in the area of OpenSHMEM extensions, implementations, and tools.”
Watch the video: http://insidehpc.com/2015/02/video-preparing-openshmem-for-exascale/
See more talks in the Stanford HPC Conference Video Gallery: http://wp.me/P3RLHQ-dOO
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
A Source-To-Source Approach to HPC ChallengesChunhua Liao
This document discusses using source-to-source compilers to address high performance computing challenges. It presents a source-to-source approach using the ROSE compiler infrastructure to generate programming models for heterogeneous computing via directives, enable performance optimization through autotuning, and improve resilience by adding source-level redundancy. Preliminary results are shown for accelerating common kernels like AXPY, matrix multiplication, and Jacobi iteration using a prototype source-to-source OpenMP compiler called HOMP to target GPUs.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
The document summarizes a research project on multi-resolution data fusion using agent-based sensors. The project aims to develop collaborative signal processing techniques that are energy-aware, fault-tolerant, and progressively improve accuracy. Key accomplishments include developing mobile agent-based collaborative signal processing, energy-aware task scheduling algorithms, analytical battery modeling, and sensor deployment algorithms. The project has resulted in several publications and integrated some techniques successfully, while other integration efforts faced challenges.
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
In this talk, we evaluate training of deep recurrent neural networks with half-precision floats on Pascal and Volta GPUs. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers.
Strong scaling tests performed on GPU clusters show linear runtime scaling and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
This document discusses a lecture on hardware acceleration. It begins by providing background on Moore's law and how increasing transistor density led to issues with power consumption and thermal constraints. This motivated the evolution of specialized hardware acceleration to improve performance. The lecture then covers topics like coprocessors vs accelerators, common acceleration techniques, and examples of hardware acceleration. It also discusses challenges like debugging and coherency when designing accelerated systems.
Many Task Applications for Grids and SupercomputersIan Foster
The document discusses how new supercomputing applications are increasingly focused on "logistical" issues like executing many communication-intensive tasks over large shared datasets, rather than "heroic" computations of a single task. It argues that new programming models and tools are needed to efficiently manage large numbers of tasks, complex data dependencies, and failures at extreme scales of petascale and exascale computers. Examples of applications that could benefit include parameter studies, ensemble simulations, data analysis, and scientific workflows involving millions of tasks.
Similar to Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and Effective Fortran Experience (20)
The document describes a 5-day residency program hosted by the OpenPOWER Academic Discussion Group (ADG) at NIE Mysore from June 6-10, 2022. The program aims to bridge industry and academia knowledge in chip design by developing curriculum on OpenPOWER technology and training lab assistants. Engineers and academicians with 5+ years experience in chip design/verification are eligible to participate. They will collaborate on developing course materials and lab exercises to teach undergraduate students in fields like ECE and CSE. The program seeks to help fulfill India's goals in chip design manpower and self-reliance through initiatives like Make in India and the India Semiconductor Mission.
This document provides an overview of digital design and Verilog. It discusses binary numbers and boolean algebra as the foundation of digital systems. It also describes logic gates, combinational and sequential circuits, finite state machines, and datapath and control units. Finally, it introduces Verilog, describing different modeling types like gate level, behavioral, dataflow, and switch level modeling. It positions Verilog as a hardware description language used to more easily design digital circuits compared to manual drawing.
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
The document discusses AI in the enterprise, including use cases, infrastructure considerations, and the AI lifecycle. It provides examples of how AI can be applied in various industries and common patterns of analytics using AI. It also outlines the data science model development workflow and considerations for AI infrastructure, software, and data management throughout the AI lifecycle.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and Effective Fortran Experience
1. ORNL is managed by UT-Battelle
for the US Department of Energy
Targeting GPUs using OpenMP
Directives on Summit with
GenASiS: A Simple and Effective
Fortran Experience
Reuben D. Budiardja
Scientific Computing Group,
Oak Ridge Leadership Computing Facility,
Oak Ridge National Laboratory
Christian Cardall
Physics Division,
Oak Ridge National Laboratory
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak
Ridge National Laboratory, which is supported by the Office of Science of the U.S.
Department of Energy under Contract No. DE-AC05-00OR22725.
2. The Application
General Astrophysics Simulation System (GenASiS)
• Designed for parallel, large scale simulations
– weak-scale to ~100 thousands MPI processes
• Written entirely in modern Fortran (2003, 2008)
• Modular, object-oriented design, and extensible
• Multi-physics solvers:
– (Magneto)-hydrodynamics (HLL, HLLC solvers)
– Explicit 2nd order time-integration
– Self-gravity, polytropic & nuclear EoS
– Grey and spectral neutrino transport
• CPU only code with OpenMP for threading (prior to this work)
3. The Application
• Studied the role fluid instabilities ---
convection and Standing Accretion Shock
Instability (SASI) --- in supernova dynamics
• Discovered exponential magnetic field
amplification by SASI in progenitor star
→ origin of neutron star magnetic fields
• Refactored to three major subdivisions:
Basics, Mathematics, Physics → allowing
unit testing, ad-hoc/standalone tests,
mini-apps
4. Paths to Targeting GPU
• CUDA
– requires rewrite of all computational kernels
– loss of Fortran semantics (multi-d arrays, pointer/array remapping)
– requires interfacing with the rest of the (Fortran) code
• CUDA Fortran
– non standard extension to Fortran (XL, PGI)
– cannot easily fall back to standard Fortran
• Directives (OpenMP)
– retain Fortran semantics
– OpenMP 4.5 has excellent support from IBM XL (Summit), CCE (Titan)
• with excellent support for modern Fortran
5. Lower-Level GenASiS Functionality
• Fortran wrappers to OpenMP APIs
– call AllocateDevice(Value, D_Value)
→ omp_target_alloc()
call AssociateHost(D_Value, Value)
→ omp_target_associate_ptr()
call UpdateDevice(Value, D_Value),
call UpdateHost(Value, D_Value)
→ omp_target_memcpy()
Value : Fortran array
D_Value : type(c_ptr), GPU
address
● Affirmative control of data movement
● Persistent memory allocation on the device
6. Higher-level GenASiS Functionality
• StorageForm :
– a class for data and metadata; the ‘heart’ of data storage facility in GenASiS
– metadata includes units, variable names (for I/O, visualization)
– used to group together a set of related physical variables (e.g. Fluid)
– render more generic and simplified code for I/O, ghost exchange,
prolongation & restriction (AMR mesh)
• Data: StorageForm % Value ( nCells, nVariables )
• Methods:
– call StorageForm % Initialize ( ) ← allocate data on host
– call StorageForm % AllocateDevice ( ) ← allocate data on GPU
– call StorageForm % Update{Device,Host} ( ) ← transfer data
7. Sidebar: OpenMP Memory for Offload
• OpenMP maps host (CPU) variables to device (GPU) (explicitly or
implicitly)
– default copy to-from
• Presence check: if fail, new variable is created on device
– sometime requires explicit association to avoid unintentional data
movement
8. Offloading Computational Kernel
call F % Initialize &
([nCells, nVariables])
call F % AllocateDevice ( )
call F % UpdateDevice ( )
call AddKernel &
( F % Value ( :, 1 ),
F % Value ( :, 2 ), &
F % D_Value ( 1 ),
F % D_Value ( 2 ), &
F % D_Value ( 3 ),
F % Value ( :, 3 ) )
9. Example of Kernel with Pointer Remapping
real ( KDR ), dimension ( :, :, : ), pointer
:: V, dV
V ( −1:nX+2, −1:nY+2, −1:nZ+2 ) => F % Value ( : , iV )
dV ( −1:nX+2 , −1:nY+2 , −1:nZ+2 ) => dF % Value ( : , iV )
call ComputeDifferences_X ( V, F % D_Value ( iV), ... )
19. Beyond OpenMP: Using Pinned Memory
To optimize data transfers:
• pinned memory: page-locked host memory allocated using
cudaMallocHost() or cudaHostAlloc()
• created another Fortran wrapper in GenASiS, used in StorageForm
initialization method as an option
StorageForm % Initialize ( …, PinnedOption = .true.)
• No mechanism to do this with OpenMP 4.5 (but perhaps in 5.x)
20. Performance Results: Using Pinned Memory
Speedups of 1.7 - 2.0X when
pinned memory is used →
overall speedups of over 9X
from 7 CPU threads
21. Implications (Then and Now)
• c. 2010: 10243
cells with 64000 processors (Jaguar), ~3s per
timestep
Now: 64 GPUs (11 nodes) on Summit, ~1.2s per timestep
• Enable us to do higher-fidelity simulations, ensemble studies for
trends in observables
– plan to perform ~200 2D grey transport supernova simulations, tens of 3D
grey transport, and a handful of 3D spectral transport simulations
• First step towards full Boltzmann radiation transport (6-D problem +
time) with exascale computing
22. Remaining Issues and Future Work
• A single code-base with OpenMP for multi-threading and offload
– Fall back to multi-threading with target if-clause is problematic
– team distribute directive introduce deleterious effects for multi-threading
• Kernel-launch parallelism and CUDA streams
– no mechanism within OpenMP to affect and select stream
• Better compilers support for OpenMP 4.5 - 5.x
• Using CUDA-aware MPI for GPU Direct
– benefit (vs. manual staging on host) depends on message sizes
23. Conclusion
• Using OpenMP allows for a simple and effective porting of Fortran
code to target GPU
– 2 - 3 months efforts for this project