This document discusses the challenges of ensuring resilience at exascale computing. It notes that hardware will fail more frequently at exascale due to factors like more transistors, near-threshold logic, radiation effects, and manufacturing variations. Current systems rely on checkpointing and restarting after failures, but this approach will not work at exascale due to the anticipated increase in failure rates. The document recommends studying current failure rates systematically and investigating approaches like "RAID" techniques for faster in-memory checkpointing to enable resilience at the exascale.
Tim Bell from CERN presented on how they are using Puppet and other configuration management tools to manage their large infrastructure. CERN operates the Large Hadron Collider and worldwide computing grid. They are moving to adopt open source tools to better manage scaling their infrastructure from 7,000 to 15,000 servers. Puppet is helping CERN provision OpenStack cloud resources and automate configuration of complex applications and systems.
This document summarizes the EXO (Enriched Xenon Observatory) experiment which aims to search for neutrinoless double beta decay in 136Xe. It describes the EXO-200 detector which contains 200kg of xenon enriched to 80% 136Xe. The detector measures both ionization and scintillation signals to achieve high energy resolution. The document discusses the goals of EXO-200 to search for 0νββ decay, measure the 2νββ half-life, and understand operating a large liquid xenon detector. It also describes plans to identify barium daughters from double beta decays using laser spectroscopy to achieve a background-free experiment.
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...Yuji Kubota
The document discusses a new runtime monitoring tool called HeapStats that provides a lightweight Java agent for collecting troubleshooting information from a Java application. It summarizes problems with current troubleshooting approaches like insufficient logging and complicated settings. HeapStats aims to collect required information continuously with low overhead, and provide easy-to-use visualization of the collected data to help with troubleshooting memory and performance issues. The document demonstrates how HeapStats hooks into garbage collection to gather heap statistics without extra stop-the-world pauses.
20121205 open stack_accelerating_science_v3Tim Bell
CERN is a large particle physics laboratory located between Geneva and France that is seeking answers to fundamental questions about the universe. It operates the Large Hadron Collider (LHC) and collects massive amounts of data that are stored and analyzed using a worldwide computing grid. CERN is adopting open source tools like OpenStack for its computing infrastructure to better manage scaling its systems with limited staff increases. It is contributing to open source communities and training staff on these tools to gain valuable skills applicable outside of CERN.
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...Storti Mario
The document discusses advances in solving the Navier-Stokes equations on GPU hardware. It describes how GPUs have many cores designed for parallel computation. Solving PDEs like the Navier-Stokes equations on GPUs requires algorithms that can take advantage of parallelism, such as finite difference methods on structured grids. FFT solvers for the Poisson equation achieve fast O(N log N) performance on GPUs. Methods like IOP and AGP can solve Poisson problems with embedded geometries by iteratively projecting solutions to satisfy divergence and boundary conditions.
In this video from the 2017 Argonne Training Program on Extreme-Scale Computing, Pavan Balaji from Argonne presents an overview of system interconnects for HPC.
Watch the video: https://wp.me/p3RLHQ-hA4
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trends in Future CommunicationsInternational Workshop - Renato RabeloCPqD
This document summarizes research on electro-optic tunable sparse gratings in lithium niobate waveguides for applications in dense wavelength division multiplexing (DWDM) communication systems. It describes the fabrication of titanium diffused lithium niobate waveguides with silicon dioxide strain gratings to achieve mode conversion between the transverse electric and transverse magnetic modes. Test results show over 96% polarization conversion efficiency across the C-band with a free spectral range of 9.131 GHz. The device operates by applying a voltage to integrated electrodes to tune the polarization response across a wide wavelength range.
This document summarizes updates in PostgreSQL 8.3, including improvements to HOT (Heap-Only Tuples) for better OLTP performance, background vacuum and writing for more efficient I/O, and new features like SQL/XML support. It also discusses performance improvements from changes like autovacuum and increased use of GIN indexes over GiST in full text search. Specific optimizations in 8.3 help scaling on SMP/NUMA architectures and faster DML operations through better handling of vacuum and fill factor settings.
Tim Bell from CERN presented on how they are using Puppet and other configuration management tools to manage their large infrastructure. CERN operates the Large Hadron Collider and worldwide computing grid. They are moving to adopt open source tools to better manage scaling their infrastructure from 7,000 to 15,000 servers. Puppet is helping CERN provision OpenStack cloud resources and automate configuration of complex applications and systems.
This document summarizes the EXO (Enriched Xenon Observatory) experiment which aims to search for neutrinoless double beta decay in 136Xe. It describes the EXO-200 detector which contains 200kg of xenon enriched to 80% 136Xe. The detector measures both ionization and scintillation signals to achieve high energy resolution. The document discusses the goals of EXO-200 to search for 0νββ decay, measure the 2νββ half-life, and understand operating a large liquid xenon detector. It also describes plans to identify barium daughters from double beta decays using laser spectroscopy to achieve a background-free experiment.
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...Yuji Kubota
The document discusses a new runtime monitoring tool called HeapStats that provides a lightweight Java agent for collecting troubleshooting information from a Java application. It summarizes problems with current troubleshooting approaches like insufficient logging and complicated settings. HeapStats aims to collect required information continuously with low overhead, and provide easy-to-use visualization of the collected data to help with troubleshooting memory and performance issues. The document demonstrates how HeapStats hooks into garbage collection to gather heap statistics without extra stop-the-world pauses.
20121205 open stack_accelerating_science_v3Tim Bell
CERN is a large particle physics laboratory located between Geneva and France that is seeking answers to fundamental questions about the universe. It operates the Large Hadron Collider (LHC) and collects massive amounts of data that are stored and analyzed using a worldwide computing grid. CERN is adopting open source tools like OpenStack for its computing infrastructure to better manage scaling its systems with limited staff increases. It is contributing to open source communities and training staff on these tools to gain valuable skills applicable outside of CERN.
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...Storti Mario
The document discusses advances in solving the Navier-Stokes equations on GPU hardware. It describes how GPUs have many cores designed for parallel computation. Solving PDEs like the Navier-Stokes equations on GPUs requires algorithms that can take advantage of parallelism, such as finite difference methods on structured grids. FFT solvers for the Poisson equation achieve fast O(N log N) performance on GPUs. Methods like IOP and AGP can solve Poisson problems with embedded geometries by iteratively projecting solutions to satisfy divergence and boundary conditions.
In this video from the 2017 Argonne Training Program on Extreme-Scale Computing, Pavan Balaji from Argonne presents an overview of system interconnects for HPC.
Watch the video: https://wp.me/p3RLHQ-hA4
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trends in Future CommunicationsInternational Workshop - Renato RabeloCPqD
This document summarizes research on electro-optic tunable sparse gratings in lithium niobate waveguides for applications in dense wavelength division multiplexing (DWDM) communication systems. It describes the fabrication of titanium diffused lithium niobate waveguides with silicon dioxide strain gratings to achieve mode conversion between the transverse electric and transverse magnetic modes. Test results show over 96% polarization conversion efficiency across the C-band with a free spectral range of 9.131 GHz. The device operates by applying a voltage to integrated electrodes to tune the polarization response across a wide wavelength range.
This document summarizes updates in PostgreSQL 8.3, including improvements to HOT (Heap-Only Tuples) for better OLTP performance, background vacuum and writing for more efficient I/O, and new features like SQL/XML support. It also discusses performance improvements from changes like autovacuum and increased use of GIN indexes over GiST in full text search. Specific optimizations in 8.3 help scaling on SMP/NUMA architectures and faster DML operations through better handling of vacuum and fill factor settings.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
This document summarizes laboratory measurements and test beam results of a depleted monolithic active pixel sensor using a high-voltage silicon-on-insulator (HV-SOI) process. Key findings include:
1) Transistors showed radiation hardness up to 700 Mrad total ionizing dose, and charge collection properties improved with increased bias voltage, with drift and diffusion contributions able to be separated after irradiation.
2) Test beam results found signal-to-noise ratio of 22, most probable signal increased linearly with depletion depth as expected, and efficiency across the pixel improved and edges widened with higher bias voltage.
3) A cut on hit detection time removed hits with large residuals, suggesting diffusion contribution which could
The document summarizes early experiences using the Summit supercomputer at Oak Ridge National Laboratory. Summit is the world's fastest supercomputer and has been used by several early science projects. Two example applications, GTC and CoMet, have achieved good scaling and performance on Summit. Some initial issues were encountered but addressed. Overall, Summit is a very powerful system but continued software improvements are needed to optimize applications for its complex hardware architecture.
1. The Trans-Pacific Grid Datafarm testbed provides 70 terabytes of disk capacity and 13 gigabytes per second of disk I/O performance across clusters in Japan, the US, and Thailand.
2. Using the GNET-1 network testbed device, the Trans-Pacific Grid Datafarm achieved stable transfer rates of up to 3.79 gigabits per second during a file replication experiment between Japan and the US, near the theoretical maximum of 3.9 gigabits per second.
3. Precise pacing of network traffic flows using inter-frame gap controls on the GNET-1 device allowed for high-speed, lossless utilization of long-haul trans-Pacific network links.
CERN is a global scientific research organization located in Geneva, Switzerland that operates the largest particle physics laboratory in the world. It was founded in 1954 and has over 10,000 scientists from over 100 countries working on experiments to study the fundamental constituents of matter and the forces that act between them. CERN generates enormous amounts of data from experiments like the Large Hadron Collider, with over 15 petabytes of new data generated each year that is distributed to computing centers around the world for analysis. Solving the mysteries of the universe through these experiments requires advanced computing technologies and global collaboration to process and make sense of the massive volumes of data being collected.
1) The document discusses two scientific challenges - mapping the brain's connectivity (connectome project) and understanding how the universe began (MWA project).
2) It describes the techniques of electron microscopy, confocal microscopy, and multi-scale imaging used in the connectome project to deal with the huge data challenge of mapping the brain at the neuronal level.
3) For the MWA project, which studies the early universe using radio astronomy, it addresses the data processing challenges using GPUs and CUDA programming to achieve real-time calibration and imaging of data from the Murchison radio telescope in Australia.
The document summarizes the performance of several supercomputers on the Green500 list from 2004 to 2011. It notes that the BG/Q system from 2011 had the highest performance on the list with 205 gigaflops per watt, greatly improving upon the BG/L from 2004 which achieved 0.33 gigaflops per watt. It also highlights that the MinoTauro system, based in Spain, was the highest ranked European system and 7th overall on the November 2011 Green500 list with an efficiency of 1266 megaflops per watt, demonstrating Europe's competitiveness in energy-efficient supercomputing.
The document discusses developments in supercomputing, including the Top500 list which ranks the most powerful supercomputers in the world based on performance on LINPACK benchmarks. It provides details on the increasing performance of supercomputers over time, with the most powerful system in November 2011 being the K computer in Japan performing at 10.5 petaflops. The document also summarizes key details on some of the top 10 supercomputers from the 2011 Top500 list and discusses trends in supercomputing hardware and software.
Riken's Fujitsu K computer is the world's fastest supercomputer, with a peak performance of over 11 petaflops. It uses a homogeneous architecture of over 700,000 SPARC64 VIIIfx processors connected via a high-speed interconnect. Looking ahead, future exascale supercomputers in the 2018 timeframe are projected to have over 1 exaflop of peak performance, use over 1 billion processing cores, and consume around 20 megawatts of power. Significant technological advancements will be required across hardware and software to achieve exascale capabilities.
This document summarizes David Wiltshire's timescape cosmology, an alternative to the standard cosmological model that accounts for large scale inhomogeneities in the universe. It proposes that spatial curvature gradients between overdense walls and underdense voids can lead to differences in the calibration of clocks and rulers between local observers and globally averaged observers. Several observational tests are discussed that provide tentative support for the timescape scenario over LambdaCDM, including supernova luminosity distances, baryon acoustic oscillations, and predictions of Hubble flow variance.
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
Do you have the background necessary to take full advantage of Tungsten Replicator in your environments? Tungsten offers enterprise-quality replication features in an open source package hosted on Google Code. This virtual course teaches you how to set up innovative topologies that solve complex replication problems. We start with single MySQL servers running MySQL replication and show a simple path migration path to Tungsten.
Course Topics
- Checking host and MySQL prerequisites
- Downloading code from http://code.google.com/p/tungsten-replicator/
- Installation using the tungsten-installer utility
- Transaction filtering using standard filters as well as customized filters you write yourself
- Enabling and managing parallel replication
- Configuring multi-master and fan-in using multiple replication services
- Backup and restore integration
- Troubleshooting replication problems
- Logging bugs and participating in the Tungsten Replicator community
Replication is a powerful technology that takes knowledge and planning to use effectively. We give you the background that makes replication easier to set up, and allows you to take full advantage of the Tungsten Replicator benefits. Learn how to configure and use it more effectively for your projects in the cloud as well as on-premises hardware.
The document discusses the technical evolution of supercomputing and programming models from a perspective of punctuated equilibrium theory. It argues that supercomputing evolution consists of long periods of gradual change interrupted by periods of rapid revolution, such as the shift from vector machines to microprocessor clusters in the 1990s and the recent rise of accelerators. However, revolutions in supercomputing have had varying degrees of success in becoming the dominant paradigm, and key obstacles remain around the long lifespan of scientific codes and rapid changes in hardware.
Programming Models for High-performance ComputingMarc Snir
This document discusses programming systems for high-performance computing. It begins by describing how distributed memory parallel systems replaced vector machines in the early 1990s. Message passing interface (MPI) became the dominant programming model. The document then examines reasons for MPI's success, including quick implementation, low cost, and good performance. It also discusses factors needed for a new programming system to succeed, such as addressing compelling needs and backward compatibility. While MPI will continue to be used, its scalability may become an issue for exascale systems.
The document discusses trends and challenges in supercomputing over the next 10 years. It describes how CMOS technology is reaching its limits as transistor sizes shrink below 7nm, signaling the end of Moore's Law. No alternative technology is ready to replace CMOS, unlike the transition from bipolar to CMOS chips in the 1990s. The increasing heterogeneity of hardware, with GPUs, accelerators, and different memory types, also makes programming models unstable. The future may see a transition to new materials like III-V or new chip architectures, but developing new technologies for the last steps of scaling will be very expensive.
This document discusses resilience challenges for exascale supercomputing systems expected around 2020-2024. Such systems will have around 1 exaflop/s of performance but will also face a much higher mean time between failures (MTBF) of less than one day due to their large scale and complex hardware. Current approaches like checkpointing jobs every 15-20 minutes will result in low system utilization of around 85% for today's systems. New approaches are needed to address silent data corruptions and complex cascading failures at exascale.
This document introduces programming models for high-performance computing (HPC). It establishes a taxonomy to classify programming models and systems. The main goals are to introduce the current prominent programming models, including message-passing, shared memory, and bulk synchronous models. The document also discusses that there is no single best solution and that there are trade-offs between different approaches. Implementation stacks and hardware architectures are reviewed to provide context on how programming models map to low-level execution.
The document summarizes carbon nanotube field effect transistors (CNFETs). It discusses how CNFETs offer advantages over traditional MOSFETs such as ballistic transport, high drive current, temperature resilience, and low capacitance. However, large-scale manufacturing of CNFETs poses challenges and circuit performance can only be estimated through simulations currently. The document also describes the structure, properties, types and performance of CNFETs. It analyzes how CNFET design can overcome issues facing MOSFET scaling like leakage and process variation.
This document summarizes a study of CEO succession events among the largest 100 U.S. corporations between 2005-2015. The study analyzed executives who were passed over for the CEO role ("succession losers") and their subsequent careers. It found that 74% of passed over executives left their companies, with 30% eventually becoming CEOs elsewhere. However, companies led by succession losers saw average stock price declines of 13% over 3 years, compared to gains for companies whose CEO selections remained unchanged. The findings suggest that boards generally identify the most qualified CEO candidates, though differences between internal and external hires complicate comparisons.
1) The document discusses multicore scheduling challenges for automotive electronic control units (ECUs) as the automotive industry moves toward multicore architectures.
2) It presents a model for scheduling numerous tasks or "runnables" across multiple cores of an ECU. The model considers attributes of each runnable like period, worst case execution time, initial offset, and core allocation constraints.
3) The document proposes using load balancing algorithms like Least Loaded and variants to build schedule tables that assign runnables to slots on each core, aiming to balance load over time while avoiding peaks. It has implemented these approaches in a scheduling tool.
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
Complicating Complexity: Performance in a New Machine AgeMaurice Naftalin
It’s no simple matter to explain the performance of even a simple Java program, when instruction execution time is often overshadowed by other costs. Even when network and IO have been taken into account, two programs with similar computational complexity—their big-O characteristics—can differ in cache usage, with big resulting differences in performance.
For the Java programmer, this means that big-O analysis no longer gives enough guidance in choosing a collection implementation for performance-critical code. In this talk we’ll explore optimization of the memory layout and performance of some simple programs via alternative collection implementations and frameworks, and object layout libraries.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
This document summarizes laboratory measurements and test beam results of a depleted monolithic active pixel sensor using a high-voltage silicon-on-insulator (HV-SOI) process. Key findings include:
1) Transistors showed radiation hardness up to 700 Mrad total ionizing dose, and charge collection properties improved with increased bias voltage, with drift and diffusion contributions able to be separated after irradiation.
2) Test beam results found signal-to-noise ratio of 22, most probable signal increased linearly with depletion depth as expected, and efficiency across the pixel improved and edges widened with higher bias voltage.
3) A cut on hit detection time removed hits with large residuals, suggesting diffusion contribution which could
The document summarizes early experiences using the Summit supercomputer at Oak Ridge National Laboratory. Summit is the world's fastest supercomputer and has been used by several early science projects. Two example applications, GTC and CoMet, have achieved good scaling and performance on Summit. Some initial issues were encountered but addressed. Overall, Summit is a very powerful system but continued software improvements are needed to optimize applications for its complex hardware architecture.
1. The Trans-Pacific Grid Datafarm testbed provides 70 terabytes of disk capacity and 13 gigabytes per second of disk I/O performance across clusters in Japan, the US, and Thailand.
2. Using the GNET-1 network testbed device, the Trans-Pacific Grid Datafarm achieved stable transfer rates of up to 3.79 gigabits per second during a file replication experiment between Japan and the US, near the theoretical maximum of 3.9 gigabits per second.
3. Precise pacing of network traffic flows using inter-frame gap controls on the GNET-1 device allowed for high-speed, lossless utilization of long-haul trans-Pacific network links.
CERN is a global scientific research organization located in Geneva, Switzerland that operates the largest particle physics laboratory in the world. It was founded in 1954 and has over 10,000 scientists from over 100 countries working on experiments to study the fundamental constituents of matter and the forces that act between them. CERN generates enormous amounts of data from experiments like the Large Hadron Collider, with over 15 petabytes of new data generated each year that is distributed to computing centers around the world for analysis. Solving the mysteries of the universe through these experiments requires advanced computing technologies and global collaboration to process and make sense of the massive volumes of data being collected.
1) The document discusses two scientific challenges - mapping the brain's connectivity (connectome project) and understanding how the universe began (MWA project).
2) It describes the techniques of electron microscopy, confocal microscopy, and multi-scale imaging used in the connectome project to deal with the huge data challenge of mapping the brain at the neuronal level.
3) For the MWA project, which studies the early universe using radio astronomy, it addresses the data processing challenges using GPUs and CUDA programming to achieve real-time calibration and imaging of data from the Murchison radio telescope in Australia.
The document summarizes the performance of several supercomputers on the Green500 list from 2004 to 2011. It notes that the BG/Q system from 2011 had the highest performance on the list with 205 gigaflops per watt, greatly improving upon the BG/L from 2004 which achieved 0.33 gigaflops per watt. It also highlights that the MinoTauro system, based in Spain, was the highest ranked European system and 7th overall on the November 2011 Green500 list with an efficiency of 1266 megaflops per watt, demonstrating Europe's competitiveness in energy-efficient supercomputing.
The document discusses developments in supercomputing, including the Top500 list which ranks the most powerful supercomputers in the world based on performance on LINPACK benchmarks. It provides details on the increasing performance of supercomputers over time, with the most powerful system in November 2011 being the K computer in Japan performing at 10.5 petaflops. The document also summarizes key details on some of the top 10 supercomputers from the 2011 Top500 list and discusses trends in supercomputing hardware and software.
Riken's Fujitsu K computer is the world's fastest supercomputer, with a peak performance of over 11 petaflops. It uses a homogeneous architecture of over 700,000 SPARC64 VIIIfx processors connected via a high-speed interconnect. Looking ahead, future exascale supercomputers in the 2018 timeframe are projected to have over 1 exaflop of peak performance, use over 1 billion processing cores, and consume around 20 megawatts of power. Significant technological advancements will be required across hardware and software to achieve exascale capabilities.
This document summarizes David Wiltshire's timescape cosmology, an alternative to the standard cosmological model that accounts for large scale inhomogeneities in the universe. It proposes that spatial curvature gradients between overdense walls and underdense voids can lead to differences in the calibration of clocks and rulers between local observers and globally averaged observers. Several observational tests are discussed that provide tentative support for the timescape scenario over LambdaCDM, including supernova luminosity distances, baryon acoustic oscillations, and predictions of Hubble flow variance.
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
Do you have the background necessary to take full advantage of Tungsten Replicator in your environments? Tungsten offers enterprise-quality replication features in an open source package hosted on Google Code. This virtual course teaches you how to set up innovative topologies that solve complex replication problems. We start with single MySQL servers running MySQL replication and show a simple path migration path to Tungsten.
Course Topics
- Checking host and MySQL prerequisites
- Downloading code from http://code.google.com/p/tungsten-replicator/
- Installation using the tungsten-installer utility
- Transaction filtering using standard filters as well as customized filters you write yourself
- Enabling and managing parallel replication
- Configuring multi-master and fan-in using multiple replication services
- Backup and restore integration
- Troubleshooting replication problems
- Logging bugs and participating in the Tungsten Replicator community
Replication is a powerful technology that takes knowledge and planning to use effectively. We give you the background that makes replication easier to set up, and allows you to take full advantage of the Tungsten Replicator benefits. Learn how to configure and use it more effectively for your projects in the cloud as well as on-premises hardware.
The document discusses the technical evolution of supercomputing and programming models from a perspective of punctuated equilibrium theory. It argues that supercomputing evolution consists of long periods of gradual change interrupted by periods of rapid revolution, such as the shift from vector machines to microprocessor clusters in the 1990s and the recent rise of accelerators. However, revolutions in supercomputing have had varying degrees of success in becoming the dominant paradigm, and key obstacles remain around the long lifespan of scientific codes and rapid changes in hardware.
Programming Models for High-performance ComputingMarc Snir
This document discusses programming systems for high-performance computing. It begins by describing how distributed memory parallel systems replaced vector machines in the early 1990s. Message passing interface (MPI) became the dominant programming model. The document then examines reasons for MPI's success, including quick implementation, low cost, and good performance. It also discusses factors needed for a new programming system to succeed, such as addressing compelling needs and backward compatibility. While MPI will continue to be used, its scalability may become an issue for exascale systems.
The document discusses trends and challenges in supercomputing over the next 10 years. It describes how CMOS technology is reaching its limits as transistor sizes shrink below 7nm, signaling the end of Moore's Law. No alternative technology is ready to replace CMOS, unlike the transition from bipolar to CMOS chips in the 1990s. The increasing heterogeneity of hardware, with GPUs, accelerators, and different memory types, also makes programming models unstable. The future may see a transition to new materials like III-V or new chip architectures, but developing new technologies for the last steps of scaling will be very expensive.
This document discusses resilience challenges for exascale supercomputing systems expected around 2020-2024. Such systems will have around 1 exaflop/s of performance but will also face a much higher mean time between failures (MTBF) of less than one day due to their large scale and complex hardware. Current approaches like checkpointing jobs every 15-20 minutes will result in low system utilization of around 85% for today's systems. New approaches are needed to address silent data corruptions and complex cascading failures at exascale.
This document introduces programming models for high-performance computing (HPC). It establishes a taxonomy to classify programming models and systems. The main goals are to introduce the current prominent programming models, including message-passing, shared memory, and bulk synchronous models. The document also discusses that there is no single best solution and that there are trade-offs between different approaches. Implementation stacks and hardware architectures are reviewed to provide context on how programming models map to low-level execution.
The document summarizes carbon nanotube field effect transistors (CNFETs). It discusses how CNFETs offer advantages over traditional MOSFETs such as ballistic transport, high drive current, temperature resilience, and low capacitance. However, large-scale manufacturing of CNFETs poses challenges and circuit performance can only be estimated through simulations currently. The document also describes the structure, properties, types and performance of CNFETs. It analyzes how CNFET design can overcome issues facing MOSFET scaling like leakage and process variation.
This document summarizes a study of CEO succession events among the largest 100 U.S. corporations between 2005-2015. The study analyzed executives who were passed over for the CEO role ("succession losers") and their subsequent careers. It found that 74% of passed over executives left their companies, with 30% eventually becoming CEOs elsewhere. However, companies led by succession losers saw average stock price declines of 13% over 3 years, compared to gains for companies whose CEO selections remained unchanged. The findings suggest that boards generally identify the most qualified CEO candidates, though differences between internal and external hires complicate comparisons.
1) The document discusses multicore scheduling challenges for automotive electronic control units (ECUs) as the automotive industry moves toward multicore architectures.
2) It presents a model for scheduling numerous tasks or "runnables" across multiple cores of an ECU. The model considers attributes of each runnable like period, worst case execution time, initial offset, and core allocation constraints.
3) The document proposes using load balancing algorithms like Least Loaded and variants to build schedule tables that assign runnables to slots on each core, aiming to balance load over time while avoiding peaks. It has implemented these approaches in a scheduling tool.
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
Complicating Complexity: Performance in a New Machine AgeMaurice Naftalin
It’s no simple matter to explain the performance of even a simple Java program, when instruction execution time is often overshadowed by other costs. Even when network and IO have been taken into account, two programs with similar computational complexity—their big-O characteristics—can differ in cache usage, with big resulting differences in performance.
For the Java programmer, this means that big-O analysis no longer gives enough guidance in choosing a collection implementation for performance-critical code. In this talk we’ll explore optimization of the memory layout and performance of some simple programs via alternative collection implementations and frameworks, and object layout libraries.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Evaluating computers involves considering metrics like latency, throughput, bandwidth, cost, power, and reliability. Latency refers to how long a single task takes and is usually measured in seconds or clock cycles. Performance is defined as the inverse of latency, so a system with lower latency is considered higher performing. Amdahl's Law states that the overall speedup from optimizing a portion of a system is limited by the percentage of time spent in that portion. It is important for determining whether optimizations are worthwhile.
The document discusses design for test (DFT) techniques. It explains that DFT aims to improve the testability of chip designs by adding mechanisms to control and observe internal nodes for manufacturing testing. This allows testing of each block or component on the chip to identify defective parts. Specifically, it discusses using scan chains to test combinational logic, and techniques like MBIST and boundary scan for testing memories and I/O, respectively. The goal of DFT is to effectively test designs at the component level to improve quality and yield.
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
This document discusses CPU optimizations for virtual machines (VMs) running on the CERN Cloud using OpenStack. Initially, VMs saw around 20% lower performance than bare metal hosts due to virtualization overhead. Through various optimizations including disabling EPT, enabling NUMA awareness and huge page sizes, virtualization overhead was reduced to around 3-5% on VMs compared to bare metal, allowing high-performance computing workloads to run effectively on the CERN Cloud. While challenging to detect during testing, these optimizations were critical to deploying VMs for production HPC workloads with performance near that of physical hardware.
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Nicolas Navet
M. Grenier, L. Havet, N. Navet, "Pushing the limits of CAN - Scheduling frames with offsets provides a major performance boost", Proc. of the 4th European Congress Embedded Real Time Software (ERTS 2008), Toulouse, France, January 29 - February 1, 2008.
The document discusses techniques for improving instruction level parallelism (ILP) in pipelines, including superpipeling, multiple issue, loop unrolling, and speculation. It describes superscalar processors that can issue multiple instructions per cycle dynamically and VLIW processors that rely more on static multiple issue determined at compile time. Issues with multiple issue like instruction packaging and hazard handling are also covered. Dynamic pipeline scheduling is explained as dividing the pipeline into fetch/decode, reservation stations, functional units, and commit units to enable out-of-order execution.
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
This document discusses using a quantum computer to simulate DNA-based vaccines by indexing and aligning short DNA reads to a reference genome. It describes superimposing the reference genome segmented into short reads and evolving via controlled operations to the Hamming distance against the short read. The maximum probability entry indicates the alignment index. Steps include 1) superposing the indexed reference segments, 2) evolving via controlled operations to the Hamming distance, and 3) finding the maximum probability entry indicating the alignment index.
Load balancing aims to distribute work across multiple computers to optimize resource utilization and system performance. It involves techniques to minimize response times and avoid overloading parts of the system. A key consideration is the latency curve, which shows how latency increases as load approaches saturation. Load balancing strategies aim to keep latency low even at high loads by balancing work distribution. Queuing theory concepts like Little's Law, which relates queue size, arrival rate and wait time, can provide insights for analyzing and improving load balancing approaches.
A shared-filesystem-memory approach for running IDA in parallel over informal...openseesdays
This document describes a method for running incremental dynamic analysis (IDA) in parallel over computer clusters to reduce computation time. The method distributes IDA tasks across multiple CPUs by either: (1) distributing individual seismic records to different CPUs or (2) further distributing the runs within each record to additional CPUs. This achieves near linear speedup. The method is applied to a case study building to demonstrate a reduction in analysis time from 40 hours to less than 10 hours using 20 CPUs. Monte Carlo simulations are also discussed to quantify modeling parameter uncertainties through approximate IDA techniques.
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
Giuseppe will present the differences between high-performance and high-throughput applications. High-throughput computing (HTC) refers to computations where individual tasks do not need to interact while running. It differs from High-performance (HPC) where frequent and rapid exchanges of intermediate results is required to perform the computations. HPC codes are based on tightly coupled MPI, OpenMP, GPGPU, and hybrid programs and require low latency interconnected nodes. HTC makes use of unreliable components distributing the work out to every node and collecting results at the end of all parallel tasks.
Visit: https://www.eudat.eu/eudat-summer-school
This document discusses how nanophotonics could be used to implement cache coherence in future chip architectures. It describes how nanophotonic ring resonators could enable light-speed arbitration to resolve races between concurrent coherence requests (MICRO 09). This allows implementing simpler "atomic" coherence protocols that avoid complex state transitions due to races. One such protocol is presented (HPCA 11) where mutexes circulate on a nanophotonic ring to provide fast, distributed synchronization. Simulations show the atomic protocol has comparable performance to traditional protocols and greatly reduced complexity. Optimizations like shifting ownership states can further improve performance.
This document summarizes Dan van der Ster's experience scaling Ceph at CERN. CERN uses Ceph as the backend storage for OpenStack volumes and images, with plans to also use it for physics data archival and analysis. The 3PB Ceph cluster consists of 47 disk servers and 1,128 OSDs. Some lessons learned include managing latency, handling many objects, tuning CRUSH, trusting clients, and avoiding human errors when managing such a large cluster.
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
This was presented by Louis Ledoux and Marc Casas at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/1a/presentation_louis_ledoux_posit.pdf
This document summarizes a lecture on file systems and performance. It discusses the read/write process for magnetic disks involving seek time, rotational latency, and transfer time. Typical numbers for these parameters in magnetic disks are provided. Flash/SSD memory is also discussed as an alternative storage technology with advantages like low latency, no moving parts, and high throughput but also drawbacks like limited endurance. The document introduces concepts from queueing theory that can help analyze the performance of I/O systems, like modeling request arrival and service times as probabilistic distributions. Key metrics like response time and throughput are discussed for evaluating I/O performance.
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsThe Linux Foundation
This document discusses server consolidation challenges on multi-core systems. It finds that hypervisor overhead increases significantly under high system load. Frequent context switching accounts for a large portion of hypervisor CPU cycles. Optimizing the credit scheduler to reduce context switching frequency improves performance by lowering hypervisor overhead by 22% and increasing performance per CPU utilization by 15%.
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Resilience at exascale
1. Resilience at Exascale
Marc
Snir
Director,
Mathema0cs
and
Computer
Science
Division
Argonne
Na0onal
Laboratory
Professor,
Dept.
of
Computer
Science,
UIUC
2. The Problem
• “Resilience
is
the
black
swan
of
the
exascale
program”
• DOE
&
DoD
commissioned
several
reports
– Inter-‐Agency
Workshop
on
HPC
Resilience
at
Extreme
Scale
hOp://ins0tute.lanl.gov/resilience/docs/Inter-‐
AgencyResilienceReport.pdf
(Feb
2012)
– U.S.
Department
of
Energy
Fault
Management
Workshop
hOp://shadow.dyndns.info/publica0ons/geist12department.pdf
(June
2012)
– ICIS
workshop
on
“addressing
failures
in
exascale
compu0ng”
(report
forthcoming)
• Talk:
– Discuss
the
liOle
we
understand
– Discuss
how
we
could
learn
more
– Seek
you
feedback
on
the
report’s
content
2
4. Failures Today
• Latest
large
systema0c
study
(Schroeder
&
Gibson)
published
in
2005
–
quite
obsolete
• Failures
per
node
per
year:
2-‐20
• Root
cause
of
failures
– Hardware
(30%-‐60%),
SW
(5%-‐25%),
Unknown
(20%-‐30%)
• Mean-‐Time
to
Repair:
3-‐24
hours
• Current
anecdotal
numbers:
– Applica0on
crash:
>1
day
– Global
system
crash:
>
1
week
– Applica0on
checkpoint/restart:
15-‐20
minutes
(checkpoint
oden
“free”)
– System
restart:
>
1
hour
– Sod
HW
errors
suspected
4
5. Recommendation
• Study
in
systema0c
manner
current
failure
types
and
rates
– Using
error
logs
at
major
DOE
labs
– Pushing
for
common
ontology
• Spend
0me
hun0ng
for
sod
HW
errors
• Study
causes
of
errors
(?)
– Most
HW
faults
(>99%)
have
no
effect
(processors
are
hugely
inefficient?)
– The
common
effect
of
sod
HW
bit
flips
is
SW
failure
– HW
faults
can
(oden?)
be
due
to
design
bugs
– Vendors
are
cagey
• Time
for
a
consor0um
effort?
5
6. Current Error-Handling
• Applica0on:
global
checkpoint
&
global
restart
• System:
– Repair
persistent
state
(file
system,
databases)
– Clean
slate
restart
for
everything
else
• Quick
analysis:
– Assume
failures
have
Poisson
distribu0on
checkpoint
0me
C=1;
recover+restart
0me
=
R;
MTBF=M
− τ /M
– Op0mal
checkpoint
interval
τ
sa0sfies
e = (M − τ +1) / M
– System
u0liza0on
U
is
U = (M − R − τ +1) / M
6
8. Utilization as Function of MTBF and R
• E.g.
– C
=
15
mins
(=1)
– R
=
1
hour
(=4)
– MTBF=
25
hours
(=100)
• U
≈
80%
8
9. Projecting Ahead
• “Comfort
zone:”
– Checkpoint
0me
<1%
MTBF
– Repair
0me
<5%
MTBF
• Assume
MTBF
=
1
hour
– Checkpoint
0me
≈
0.5
minute
– Repair
0me
≈
3
minutes
• Is
this
doable?
– Yes
if
done
in
memory
– E.g.,
using
“RAID”
techniques
9
10. Exascale Design Point
Systems
2012
2020-‐2024
Difference
BG/Q
Today
&
2019
Computer
System
peak
20
Pflop/s
1
Eflop/s
O(100)
Power
8.6
MW
~20
MW
System
memory
1.6
PB
32
-‐
64
PB
O(10)
(16*96*1024)
Node
performance
205
GF/s
1.2
or
15TF/s
O(10)
–
O(100)
(16*1.6GHz*8)
Node
memory
BW
42.6
GB/s
2
-‐
4TB/s
O(1000)
Node
concurrency
64
Threads
O(1k)
or
10k
O(100)
–
O(1000)
Total
Node
Interconnect
BW
20
GB/s
200-‐400GB/s
O(10)
System
size
(nodes)
98,304
O(100,000)
or
O(1M)
O(100)
–
O(1000)
(96*1024)
Total
concurrency
5.97
M
O(billion)
O(1,000)
MTTI
4
days
O(<1
day)
-‐
O(10)
Both price and power envelopes may be too aggressive!
11. Time to checkpoint
• Assume
“Raid
5”
across
memories,
checkpoint
size
~
50%
memory
size
– Checkpoint
0me
≈
0me
to
transfer
50%
of
memory
to
another
node:
Few
seconds!
– Memory
overhead
~
50%
– Energy
overhead
small
with
nVRAM
– (But
need
many
write
cycles)
– We
are
in
comfort
zone
(re
checkpoint)
• How
about
recovery?
• Time
to
restart
applica0on
same
as
0me
to
checkpoint
• Problem:
Time
to
recover
if
system
failed
11
12. Key Assumptions:
1. Errors
are
detected
(before
the
erroneous
checkpoint
is
commiOed)
2. System
failures
are
rare
(<<
1
day)
or
recovery
from
them
is
very
fast
(minutes)
12
13. Hardware Will Fail More Frequently
• More,
smaller
transistor
• Near-‐threshold
logic
☛ More
frequent
bit
upsets
due
to
radia0on
☛ More
frequent
mul0ple
bit
upsets
due
to
radia0on
☛ Larger
manufacturing
varia0on
☛ Faster
aging
13
14. nknown with respect to the 11nm technology node.
Hardware Error Detection: Assumptions
2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm
45nm 11nm
Cores 8 128
Scattered latches per core 200, 000 200, 000
p p
Scattered latchs in uncore relative to cores ncores ⇥ 1.25 = 0.44 ncores ⇥ 1.25 = 0.11
FIT per latch 10 1
10 1
Arrays per core (MB) 1 1
FIT per SRAM cell 10 4
10 4
Logic FIT / latch FIT 0.1 0.5 0.1 0.5
DRAM FIT (per node) 50 50
12
14
16. Summary of (Rough) Analysis
• If
no
new
technology
is
deployed
can
have
up
to
one
undetected
error
per
hour
• With
addi0onal
circuitry
could
get
down
to
one
undetected
error
per
100-‐1,000
hours
(week
–
months)
– The
cost
could
be
as
high
as
20%
addi0onal
circuits
and
25%
addi0onal
power
– Main
problems
are
in
combinatorial
logic
and
latches
– The
price
could
be
significantly
higher
(small
market
for
high-‐
availability
servers)
• Need
sodware
error
detec0on
as
an
op0on!
16
17. Application-Level Data Error Detection
• Check
checkpoint
for
correctness
before
it
is
commiOed.
– Look
for
outliers
(assume
smooth
fields)
–
handle
high
order
bit
errors
– Use
robust
solvers
–
handle
low-‐order
bit
errors
• May
not
work
for
discrete
problems,
parBcle
simulaBons,
disconBnuous
phenomena,
etc.
• May
not
work
if
outlier
is
rapidly
smoothed
(error
propagaBon)
– Check
for
global
invariants
(e.g.,
energy
preserva0on)
•
Ignore
error
• Duplicate
computa0on
of
cri0cal
variables
• Can
we
reduce
checkpoint
size
(or
avoid
them
altogether)?
– Tradeoff:
how
much
is
saved,
how
oden
data
is
checked
(big/
small
transac0ons)
17
18. How About Control Errors?
• Code
is
corrupted,
jump
to
wrong
address,
etc.
– Rarer
(more
data
state
than
control
state)
– More
likely
to
cause
a
fatal
excep0on
– More
easy
to
protect
against
18
19. How About System Failures (Due to Hardware)?
• Kernel
data
corrupted
– Page
tables,
rou0ng
tables,
file
system
metadata,
etc.
• Hard
to
understand
and
diagnose
– Break
abstrac0on
layers
• Make
sense
to
avoid
such
failures
– Duplicate
system
computa0ons
that
affect
kernel
state
(price
can
be
tolerated
for
HPC)
– Use
transac0onal
methods
– Harden
kernel
data
• Highly
reliable
DRAM
• Remotely
accessible
NVRAM
– Predict
and
avoid
failures
19
20. Failure Prediction from Event Logs
Use a combination of signal analysis (to identify outliers) and
datamining (to find correlations)
e 1 presents
pe.
oes not have
e, a memory
arge number
sh the error
Data mining
themselfs in
e more than
ent the ma-
l to extract
ignals. This
f faults seen
between the
three signal
Gainaru, Cappello, Snir, Kramer (SC12)
that can be Fig. 2. Methodology overview of the hybrid approach 20
21. Can Hardware Failure Be Predicted?
Prediction method Precision Recall Seq used Pred failures
ELSA hybrid 91.2% 45.8% 62 (96.8%) 603
ELSA signal 88.1% 40.5% 117 (92.8%) 534
Data mining 91.9% 15.7% 39 (95.1%) 207
TABLE II
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
C Precision Recall MTTF for the whole system Waste gain
.
1min 92 20 one day 9.13%
n 1min 92 36 one day 17.33%
10s 92 36 one day 12.09%
os The metrics used for evaluating one day
10s 92 45 prediction performance are
15.63%
prediction 92 recall:
1min
10s
and
92
50
65
5h
5h
21.74%
24.78%
• Precision is the fraction of failure predictions that turn
TABLE III
) out to be correct.
Migrating processes when node failure is predicted can
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
significantly improve utilization
• Recall is the fraction of failures that are predicted.
o
s 21
22. How About Software Bugs?
• Parallel
code
(transforma0onal,
mostly
determinis0c)
vs.
concurrent
code
(event
driven,
very
nondeterminis0c)
– Hard
to
understand
concurrent
code
(cannot
comprehend
more
than
10
concurrent
actors)
– Hard
to
avoid
performance
bugs
(e.g.,
overcommiOed
resources
causing
0me-‐outs)
– Hard
to
test
for
performance
bugs
• Concurrency
performance
bugs
(e.g.,
in
parallel
file
systems)
are
major
source
of
failures
on
current
supercomputers
– Problem
will
worsen
as
we
scale
up
–
performance
bugs
become
more
frequent
• Need
to
become
beOer
at
avoiding
performance
bugs
(learn
from
control
theory
and
real-‐0me
systems)
– Make
system
code
more
determinis0c
22
23. System Recovery
• Local
failures
(e.g.,
node
kernel
crashed)
are
not
a
major
problem
-‐-‐
Can
replace
• Global
failures
(e.g.,
global
file
system
crashed)
are
the
hardest
problem
-‐-‐
Need
to
avoid
• Issue:
Fault
containment
– Local
hardware
failure
corrupts
global
state
– Localized
hardware
error
causes
global
performance
bugs
23
24. Quid Custodiet Ipsos Custodes?
• Who
watches
the
watchmen?
• Need
robust
(scalable,
fault-‐tolerant)
infrastructure
for
error
repor0ng
and
recovery
orchestra0on
– Current
approach
of
out-‐of-‐band
monitoring
and
control
is
too
restricted
24
25. Summary
• Resilience
is
a
major
problem
in
the
exascale
era
• Major
todos:
– Need
to
understand
much
beOer
failures
in
current
systems
and
future
trends
– Need
to
develop
good
applica0on
error
detec0on
methodology;
error
correc0on
is
desirable,
but
less
cri0cal
– Need
to
significantly
reduce
system
errors
and
reduce
system
recovery
0me
– Need
to
develop
robust
infrastructure
for
error
handling
25