El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
In this video from Linaro Connect 2019, Satoshi Matsuoka from Riken presents: A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exascale Performance.
"Fugaku is the flagship next generation national supercomputer being developed by Riken R-CCS and Fujitsu in collaboration. Fugaku will have hyperscale datacenter class resource in a single exascale machine, with more than 150,000 nodes of sever-class Fujitsu A64fx many-core Arm CPUs with the new SVE (Scalable Vector Extension) with low precision math for the first time in the world, accelerating both HPC and AI workloads, augmented with HBM2 memory paired with each CPU, exhibiting nearly a Terabyte/s memory bandwidth for both HPC and AI rapid data movements."
Watch the video: https://wp.me/p3RLHQ-kYn
Learn more: https://postk-web.r-ccs.riken.jp/
and
https://connect.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
In this video from Linaro Connect 2019, Satoshi Matsuoka from Riken presents: A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exascale Performance.
"Fugaku is the flagship next generation national supercomputer being developed by Riken R-CCS and Fujitsu in collaboration. Fugaku will have hyperscale datacenter class resource in a single exascale machine, with more than 150,000 nodes of sever-class Fujitsu A64fx many-core Arm CPUs with the new SVE (Scalable Vector Extension) with low precision math for the first time in the world, accelerating both HPC and AI workloads, augmented with HBM2 memory paired with each CPU, exhibiting nearly a Terabyte/s memory bandwidth for both HPC and AI rapid data movements."
Watch the video: https://wp.me/p3RLHQ-kYn
Learn more: https://postk-web.r-ccs.riken.jp/
and
https://connect.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
To cope with the steaming out of Moore’s law and Dennard’s scaling end, the world of High-Performance Computing is rapidly evolving toward high-throughput architectures with specialized hardware for vectors and tensor operations in conjunction with sophisticated power management subsystems. RISC-V ISA and Open-HW can prove its effectiveness in fostering innovation in the HPC market as it has done in the embedded one. In this talk, I will introduce a set of building blocks for future HPC systems we have been designing at the ETH Zurich and the University of Bologna.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This slide explains about the detailed view hardware architecture which includes CPUs, GPUs, Interconnect networks and applications used by the summit supercomputer
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
IBM and the Netherlands Institute for Radio Astronomy ASTRON have unveiled the world’s first water-cooled 64-bit microserver. The prototype, which is roughly the size of a smartphone, is part of the proposed IT roadmap for the Square Kilometre Array (SKA), an international consortium to build the world’s largest and most sensitive radio telescope. Scientists estimate that the processing power required to operate the telescope will be equal to several millions of today’s fastest computers.
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size.
Not only is the microserver compact, it is also very energy-efficient. One of its innovations is hotwater cooling, which in addition to keeping the chip operating temperature below 85 degrees C, will also transport electrical power by means of a copper plate. The concept is based on the same technology IBM developed for the SuperMUC supercomputer located outside of Munich, Germany. IBM scientists hope to keep each microserver operating between 35–40 watts including the system on a chip (SOC) — the current design is 60 watts.
The next step for scientists is to begin to take 128 of the microserver boards using the newest T4240 chips to create a 2U rack unit with 1536 cores and 3072 threads with up to 6 terabytes of DRAM. In addition, they will be adding an Ethernet switch and power module to the integrated water-cooling.
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
In this deck from the Perth HPC Conference, Werner Scholz from XENON Systems presents: Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC.
"A decade ago, 100 watts per CPU was devastating to thermal design. Today, Intel’s highest performing CPUs (e.g. Intel Cascade Lake-AP 9282 processor) have a thermal design envelope of 400 watts. There really is no end in sight, and accommodating more power is critical to advancing performance. The ability to dissipate the resulting heat is the hard ceiling that systems face in terms of performance – giving greater importance to liquid cooling breakthroughs. With liquid cooling, less energy is expended to cool systems – a significant savings in HPC deployments with arrays of servers drawing energy and generating heat. Electrical current drives the CPU and enables it to function. This electrical power is converted into thermal energy (heat). To maintain a stable temperature, the CPU needs to be cooled by efficiently removing this heat and releasing it. Liquid cooling is the best way to cool a system because liquid transfers heat much more efficiently than air. From an environmental perspective, liquid cooling reduces both those characteristics to create a smarter and more ecological approach on a grand scale. The cascade of value continues, as ambient heat removed from systems can then be used to heat buildings and augment or replace traditional heating systems. It’s an intelligent approach to thermal management, distributing the economic value of reduced energy use and transforming heat into an enterprise asset."
Watch the video: https://wp.me/p3RLHQ-kZa
Learn more: https://www.xenon.com.au/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Yuichiro Ajima from Fujitsu presents: The Tofu Interconnect D.
"Through the development of post-K, which will be equipped with this CPU, Fujitsu will contribute to the resolution of social and scientific issues in such computer simulation fields as cutting-edge research, health and longevity, disaster prevention and mitigation, energy, as well as manufacturing, while enhancing industrial competitiveness and contributing to the creation of Society 5.0 by promoting applications in big data and AI fields."
Learn more: https://insidehpc.com/2018/08/fujitsu-unveils-details-post-k-supercomputer-processor-powered-arm/
and
http://www.fujitsu.com/jp/solutions/business-technology/tc/catalog/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-iodice
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gian Marco Iodice, Software Engineer at ARM, presents the "Using SGEMM and FFTs to Accelerate Deep Learning" tutorial at the May 2016 Embedded Vision Summit.
Matrix Multiplication and the Fast Fourier Transform are numerical foundation stones for a wide range of scientific algorithms. With the emergence of deep learning, they are becoming even more important, particularly as use cases extend into mobile and embedded devices. In this presentation, lodice discusses and analyzes how these two key, computationally-intensive algorithms can be used to gain significant performance improvements for convolutional neural network (CNN) implementations.
After a brief introduction to the nature of CNN computations, Iodice explores the use of GEMM (General Matrix Multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows examples of OpenCL implementations of these functions and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types.
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
To cope with the steaming out of Moore’s law and Dennard’s scaling end, the world of High-Performance Computing is rapidly evolving toward high-throughput architectures with specialized hardware for vectors and tensor operations in conjunction with sophisticated power management subsystems. RISC-V ISA and Open-HW can prove its effectiveness in fostering innovation in the HPC market as it has done in the embedded one. In this talk, I will introduce a set of building blocks for future HPC systems we have been designing at the ETH Zurich and the University of Bologna.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This slide explains about the detailed view hardware architecture which includes CPUs, GPUs, Interconnect networks and applications used by the summit supercomputer
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
IBM and the Netherlands Institute for Radio Astronomy ASTRON have unveiled the world’s first water-cooled 64-bit microserver. The prototype, which is roughly the size of a smartphone, is part of the proposed IT roadmap for the Square Kilometre Array (SKA), an international consortium to build the world’s largest and most sensitive radio telescope. Scientists estimate that the processing power required to operate the telescope will be equal to several millions of today’s fastest computers.
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size.
Not only is the microserver compact, it is also very energy-efficient. One of its innovations is hotwater cooling, which in addition to keeping the chip operating temperature below 85 degrees C, will also transport electrical power by means of a copper plate. The concept is based on the same technology IBM developed for the SuperMUC supercomputer located outside of Munich, Germany. IBM scientists hope to keep each microserver operating between 35–40 watts including the system on a chip (SOC) — the current design is 60 watts.
The next step for scientists is to begin to take 128 of the microserver boards using the newest T4240 chips to create a 2U rack unit with 1536 cores and 3072 threads with up to 6 terabytes of DRAM. In addition, they will be adding an Ethernet switch and power module to the integrated water-cooling.
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
In this deck from the Perth HPC Conference, Werner Scholz from XENON Systems presents: Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC.
"A decade ago, 100 watts per CPU was devastating to thermal design. Today, Intel’s highest performing CPUs (e.g. Intel Cascade Lake-AP 9282 processor) have a thermal design envelope of 400 watts. There really is no end in sight, and accommodating more power is critical to advancing performance. The ability to dissipate the resulting heat is the hard ceiling that systems face in terms of performance – giving greater importance to liquid cooling breakthroughs. With liquid cooling, less energy is expended to cool systems – a significant savings in HPC deployments with arrays of servers drawing energy and generating heat. Electrical current drives the CPU and enables it to function. This electrical power is converted into thermal energy (heat). To maintain a stable temperature, the CPU needs to be cooled by efficiently removing this heat and releasing it. Liquid cooling is the best way to cool a system because liquid transfers heat much more efficiently than air. From an environmental perspective, liquid cooling reduces both those characteristics to create a smarter and more ecological approach on a grand scale. The cascade of value continues, as ambient heat removed from systems can then be used to heat buildings and augment or replace traditional heating systems. It’s an intelligent approach to thermal management, distributing the economic value of reduced energy use and transforming heat into an enterprise asset."
Watch the video: https://wp.me/p3RLHQ-kZa
Learn more: https://www.xenon.com.au/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Yuichiro Ajima from Fujitsu presents: The Tofu Interconnect D.
"Through the development of post-K, which will be equipped with this CPU, Fujitsu will contribute to the resolution of social and scientific issues in such computer simulation fields as cutting-edge research, health and longevity, disaster prevention and mitigation, energy, as well as manufacturing, while enhancing industrial competitiveness and contributing to the creation of Society 5.0 by promoting applications in big data and AI fields."
Learn more: https://insidehpc.com/2018/08/fujitsu-unveils-details-post-k-supercomputer-processor-powered-arm/
and
http://www.fujitsu.com/jp/solutions/business-technology/tc/catalog/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-iodice
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gian Marco Iodice, Software Engineer at ARM, presents the "Using SGEMM and FFTs to Accelerate Deep Learning" tutorial at the May 2016 Embedded Vision Summit.
Matrix Multiplication and the Fast Fourier Transform are numerical foundation stones for a wide range of scientific algorithms. With the emergence of deep learning, they are becoming even more important, particularly as use cases extend into mobile and embedded devices. In this presentation, lodice discusses and analyzes how these two key, computationally-intensive algorithms can be used to gain significant performance improvements for convolutional neural network (CNN) implementations.
After a brief introduction to the nature of CNN computations, Iodice explores the use of GEMM (General Matrix Multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows examples of OpenCL implementations of these functions and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
In this video from the Argonne Training Program on Extreme-Scale Computing 2019, Jeffrey Vetter from ORNL presents: The Coming Age of Extreme Heterogeneity.
"In this talk, I'm going to talk about the high-level trends guiding our industry. Moore’s Law as we know it is definitely ending for either economic or technical reasons by 2025. Our community must aggressively explore emerging technologies now!"
Watch the video: https://wp.me/p3RLHQ-lic
Learn more: https://ft.ornl.gov/~vetter/
and
https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
In this deck from the HPC User Forum in Austin, Yutaka Ishikawa from Riken AICS presents: Japan's post K Computer.
Watch the video presentation: http://wp.me/p3RLHQ-fJ6
Learn more: http://hpcuserforum.com
A Library for Emerging High-Performance Computing ClustersIntel® Software
Deployed next-generation architectures and systems are characterized by high concurrency, low memory per core, and multilevels of hierarchy and heterogeneity. These characteristics bring out new challenges in energy efficiency, fault-tolerance, and scalability. Next-generation programming models and their associated middleware and runtimes have a responsibility to tackle these challenges.
This talk focuses on challenges and opportunities in designing efficient runtimes using a formula (MPI+X) to accelerate applications for emerging high-performance computing (HPC) systems with millions of processors and featuring next-generation interconnects. Energy-aware designs and codesign schemes for such environments are also emphasized. View features and sample performance numbers from the MVAPICH2 libraries.
"El álgebra lineal es una herramienta fundamental en muchos campos de la ciencia y la tecnología. Es particularmente importante en la física, la ingeniería, la informática y la estadística. La capacidad de manipular eficientemente grandes cantidades de datos y matrices complejas es esencial en estas áreas para la resolución de problemas y la toma de decisiones.
A priori, puede dar la sensación de que estamos muy lejos del uso del álgebra lineal en nuestro día a día. Sin embargo, algunas técnicas como la descomposición en valores singulares y la regresión lineal para entrenar modelos y hacer predicciones precisas están detrás de la inteligencia artificial y el aprendizaje automático. ¿Te suena ChatGPT? Puede no parecerlo, pero el álgebra lineal también está detrás en algunos de sus procesos. Por este motivo, debemos seguir trabajando en este campo, ya que su importancia seguirá creciendo a medida que se generen y analicen grandes cantidades de datos en el mundo actual.
"
La pandemia de COVID-19 ha supuesto una proliferación de mapas y contramapas. Por ello, organizaciones de la sociedad civil y movimientos sociales han generado sus propias interpretaciones y representaciones de los datos sobre la crisis. Estos también han contribuido a visibilizar aspectos, sujetos y temas que han sido desatendidos o infrarrepresentados en las visualizaciones hegemónicas y dominantes. En este contexto, la presente ponencia se centra en el análisis de los imaginarios sociales relacionados con la elaboración de mapas durante la pandemia. Es decir, trata de indagar en la importancia de los mapas para el activismo digital, las potencialidades que se extraen de esta tecnología y los valores asociados a las visualizaciones creadas con ellos. El objetivo último es reflexionar sobre la vía emergente del activismo de datos, así como sobre la intersección entre los imaginarios sociales y la geografía digital.
Designing RISC-V-based Accelerators for next generation Computers (DRAC) is a 3-year project (2019-2022) funded by the ERDF Operational Program of Catalonia 2014-2020. DRAC will design, verify, implement and fabricate a high performance general purpose processor that will incorporate different accelerators based on the RISC-V technology, with specific applications in the field of post-quantum security, genomics and autonomous navigation. In this talk, we will provide an overview of the main achievements in the DRAC project, including the fabrication of Lagarto, the first RISC-V processor developed in Spain.
This talk will begin introducing the uElectronics section of ESA at ESTEC and the general activities the group is responsible for. Then, it will go through some of the R+D on-going activities that the group is involved with, hand in hand with universities and/or companies. One of the major ones is related to the European rad-hard FPGAs that have been partially founded by ESA for several years and that will be playing a major role in the sector in the upcoming years. It´s also worth talking about the RTL soft IPs that are currently under development and that will allow us to keep on providing the European ecosystem with some key capabilities. The latter will be an overview of RISC-V space hardened on-going activities that might be replacing the current SPARC based processors available for our missions.
El objetivo de esta charla es presentar las últimas novedades incorporadas en la arquitectura ARM y describir las tendencias en la microarquitectura de los procesadores con arquitectura ARM. ARM es una empresa relativamente pequeña en comparación con otros gigantes del sector tecnológico. Sin embargo, la amplia implantación de su arquitectura, siendo ampliamente dominante en algunos sectores, y sus microarquitecturas, hacen que la tecnología ARM ocupe un lugar central en el desarrollo tecnológico del mundo actual. La tecnología ARM está presente prácticamente en todo el espectro tecnológico, desde los dispositivos más sencillos hasta el HPC y Cloud computing, pasando por los smartphones, automoción electrónica de consumo, etc
"Formal verification has been used by computer scientists for decades to prevent
software bugs. However, with a few exceptions, it has not been used by researchers
working in most areas of mathematics (geometry, algebra, analysis, etc.). In this
talk, we will discuss how this has changed in the past few years, and the possible
implications to the future of mathematical research, teaching and communication.
We will focus on the theorem prover Lean and its mathematical library
mathlib, since this is currently the system most widely used by mathematicians.
Lean is a functional programming language and interactive theorem prover based
on dependent type theory, with proof irrelevance and non-cumulative universes.
The mathlib library, open-source and designed as a basis for research level
mathematics, is one of the largest collections of formalized mathematics. It allows
classical reasoning, uses large- and small-scale automation, and is characterized
by its decentralized nature with over 200 contributors, including both computer
scientists and mathematicians."
"Part of the research community thinks that it is still early to tackle the development of quantum software engineering techniques. The reason is that how the quantum computers of the future will look like is still unknown. However, there are some facts that we can affirm today: 1) quantum and classical computers will coexist, each dedicated to the tasks at which they are most efficient. 2) quantum computers will be part of the cloud infrastructure and will be accessible through the Internet. 3) complex software systems will be made up of smaller pieces that will collaborate with each other. 4) some of those pieces will be quantum, therefore the systems of the future will be hybrid. 5) the coexistence and interaction between the components of said hybrid systems will be supported by service composition: quantum services.
This talk analyzes the challenges that the integration of quantum services poses to Service Oriented Computing."
In this talk, after a brief overview of AI concepts in particular Machine Learning (ML) techniques, some of the well-known computer design concepts for high performance and power efficiency are presented. Subsequently, those techniques that have had a promising impact for computing ML algorithms are discussed. Deep learning has emerged as a game changer for many applications in various fields of engineering and medical sciences. Although the primary computation function is matrix vector multiplication, many competing efficient implementations of this primary function have been proposed and put into practice. This talk will review and compare some of those techniques that are used for ML computer design.
Tras una breve introducción a la informática médica y unas pinceladas sobre conceptos prácticos de Inteligencia Artificial (posible definición consensuada, strong VS weak AI y técnicas y métodos comúnmente empleados), el bloque central de la charla muestra ejemplos prácticos (en forma de casos de éxito) de distintos desarrollos llevados a cabo por el grupo de Sistemas Informáticos de Nueva Generación (SING: http//sing-group.org/) en los ámbitos de (i) Informática clínica (InNoCBR, PolyDeep), (ii) Informática para investigación clínica (PathJam, WhichGenes), (iii) bioinformática traslacional (Genómica: ALTER, Proteómica: DPD, BI, BS, Mlibrary, Mass-Up, e integración de datos ÓMICOS: PunDrugs) y (iv) Informática en salud pública (CURMIS4th). Finalmente, se comenta brevemente la importancia que se espera tenga en un futuro inmediato la IA interpretable (XAI, Explainable Artificial Intelligence) y la participación humana (HITL. Human-In-The-Loop). La charla termina con una breve reflexión sobre las lecciones aprendidas por el ponente después de más de 16 años de desarrollo de sistemas inteligentes en el ámbito de la informática médica.
Many emerging applications require methods tailored towards high-speed data acquisition and filtering of streaming data followed by offline event reconstruction and analysis. In this case, the main objective is to relieve the immense pressure on the storage and communication resources within the experimental infrastructure. In other applications, ultra low latency real time analysis is required for autonomous experimental systems and anomaly detection in acquired scientific data in the absence of any prior data model for unknown events. At these data rates, traditional computing approaches cannot carry out even cursory analyses in a time frame necessary to guide experimentation. In this talk, Prof. Ogrenci will present some examples of AI hardware architectures. She will discuss the concept of co-design, which makes the unique needs of an application domain transparent to the hardware design process and present examples from three applications: (1) An in-pixel AI chip built using the HLS methodology; (2) A radiation hardened ASIC chip for quantum systems; (3) An FPGA-based edge computing controller for real-time control of a High Energy Physics experiment.
En esta conferencia se presentará una revisión del concepto de autonomía para robots móviles de campo y la identificación de desafíos para lograr un verdadero sistema autónomo, además de sugerir posibles direcciones de investigación. Los sistemas robóticos inteligentes, por lo general, obtienen conocimiento de sus funciones y del entorno de trabajo en etapa de diseño y desarrollo. Este enfoque no siempre es eficiente, especialmente en entornos semiestructurados y complejos como puede ser el campo de cultivo. Un sistema robótico verdaderamente autónomo debería desarrollar habilidades que le permitan tener éxito en tales entornos sin la necesidad de tener a-priori un conocimiento ontológico del área de trabajo y la definición de un conjunto de tareas o comportamientos predefinidos. Por lo que en esta conferencia se presentarán posibles estrategias basadas en Inteligencia Artificial que permitan perfeccionar las capacidades de navegación de robots móviles y que sean capaces de ofrecer un nivel de autonomía lo suficientemente elevado para poder ejecutar todas las tareas dentro de una misión casa-a-casa (home-to-home).
Quantum computing has become a noteworthy topic in academia and industry. The multinational companies in the world have been obtaining impressive advances in all areas of quantum technology during the last two decades. These companies try to construct real quantum computers in order to exploit their theoretical preferences over today’s classical computers in practical applications. However, they are challenging to build a full-scale quantum computer because of their increased susceptibility to errors due to decoherence and other quantum noise. Therefore, quantum error correction (QEC) and fault-tolerance protocol will be essential for running quantum algorithms on large-scale quantum computers.
The overall effect of noise is modeled in terms of a set of Pauli operators and the identity acting on the physical qubits (bit flip, phase flip and a combination of bit and phase flips). In addition to Pauli errors, there is another error named leakage errors that occur when a qubit leaves the defined computational subspace. As the location of leakage errors is unknown, these can damage even more the quantum computations. Thus, this talk will briefly provide quantum error models.
Los chatbots son un elemento clave en la transformación digital de nuestra sociedad. Están por todas partes: eCommerce, salud digital, asistencia a clientes, turismo,... Pero si habéis usado alguno, probablemente os habrá decepcionado. Lo confieso, la mayoría de los chatbots que existen son muy malos. Y es que no es nada fácil hacer un chatbot que sea realmente útil e inteligente. Un chatbot combina toda la complejidad de la ingeniería de software con la del procesamiento de lenguaje natural. Pensad que muchos chatbots hay que desplegarlos en varios canales (web, telegram, slack,...) y a menudo tienen que utilizar APIs y servicios externos, acceder a bases de datos internas o integrar modelos de lenguaje preentrenados (por ej. detectores de toxicidad), etc. Y el problema no es sólo crear el bot, si no también probarlo y evolucionarlo. En esta charla veremos los mayores desafíos a los que hay que enfrentarse cuando nos encargan un proyecto de desarrollo que incluye un chatbot y qué técnicas y estrategias podemos ir aplicando en función de las necesidades del proyecto, para conseguir, esta vez sí un chatbot que sepa de lo que habla.
Many HPC applications are massively parallel and can benefit from the spatial parallelism offered by reconfigurable logic. While modern memory technologies can offer high bandwidth, designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. Addressing these challenges requires to combine compiler optimizations, high-level synthesis, and hardware design.
In this talk, I will present challenges, solutions, and trends for generating massively parallel accelerators on FPGA for high-performance computing. These architectures can provide performance comparable to software implementations on high-end processors, and much higher energy efficiency thanks to logic customization.
The main challenge of concurrent software verification has always been in achieving modularity, i.e., the ability to divide and conquer the correctness proofs with the goal of scaling the verification effort. Types are a formal method well-known for its ability to modularize programs, and in the case of dependent types, the ability to modularize and scale complex mathematical proofs.
In this talk I will present our recent work towards reconciling dependent types with shared memory concurrency, with the goal of achieving modular proofs for the latter. Applying the type-theoretic paradigm to concurrency has lead us to view separation logic as a type theory of state, and has motivated novel abstractions for expressing concurrency proofs based on the algebraic structure of a resource and on structure-preserving functions (i.e., morphisms) between resources.
Microarchitectural attacks, such as Spectre and Meltdown, are a class of
security threats that affect almost all modern processors. These attacks exploit the side-effects resulting from processor optimizations to leak sensitive information and compromise a system’s security.
Over the years, a large number of hardware and software mechanisms for
preventing microarchitectural leaks have been proposed. Intuitively, more
defensive mechanisms are less efficient, while more permissive mechanisms may offer more performance but require more defensive programming. Unfortunately, there are no
hardware-software contracts that would turn this intuition into a basis for
principled co-design.
In this talk, we present a framework for specifying hardware/software security
contracts, an abstraction that captures a processor’s security guarantees in a
simple, mechanism-independent manner by specifying which program executions a
microarchitectural attacker can distinguish.
La aparición de vulnerabilidades por la falta de controles de seguridad es una de las causas por las que se demandan nuevos marcos de trabajo que produzcan software seguro de forma predeterminada. En la conferencia se abordará cómo transformar el proceso de desarrollo de software dando la importancia que merece la seguridad desde el inicio del ciclo de vida. Para ello se propone un nuevo modelo de desarrollo – modelo Viewnext-UEx – que incorpora prácticas de seguridad de forma preventiva y sistemática en todas las fases del proceso de ciclo de vida del software. El propósito de este nuevo modelo es anticipar la detección de vulnerabilidades aplicando la seguridad desde las fases más tempranas, a la vez que se optimizan los procesos de construcción del software. Se exponen los resultados de un escenario preventivo, tras la aplicación del modelo Viewnext-UEx, frente al escenario reactivo tradicional de aplicar la seguridad a partir de la fase de testing.
This lecture will address issues of trust in computer systems, artificial intelligence and attacks on these types of systems with practical examples. Artificial Intelligence has gained ground in several areas with different applications scenarios, but in the perspective of this lecture, the fundamental point of the discussion is: what does an artificial intelligence system should do from a security perspective and how does an intelligence system provide results on a given subject? Few people are really concerned about the behavior of these types of systems from a security point of view. If you like machine learning and security, I believe this lecture will show you interesting security problems in artificial intelligence systems.
El uso de energías renovables es clave para cumplir los objetivos de desarrollo sostenible de la Agenda 2030. Entre estas energías, la eólica es la segunda más utilizada debido a su alta eficiencia. Algunos estudios sugieren que la energía eólica será la principal fuente de generación en 2050. Por ello es conveniente seguir investigando en la aplicación de técnicas de control avanzadas en estos sistemas.
Entre estas técnicas avanzadas cabe destacar las redes neuronales y el aprendizaje por refuerzo combinadas con estrategias clásicas de control. Estas técnicas ya se han empleado con éxito en el modelado y el control de sistemas complejos.
Esta conferencia presentará la aplicación de redes neuronales y aprendizaje por refuerzo al control de aerogeneradores, centrándolo especialmente en el control de pitch. Se detallarán diferentes configuraciones con redes neuronales y otras técnicas aplicadas al control de pitch. Finalmente se propondrán algunas técnicas híbridas que combinen lógica difusa, tablas de búsqueda y redes neuronales, mostrando resultados que han permitido probar su utilidad para mejorar la eficiencia de las turbinas eólicas.
As the world's energy demand rises, so does the amount of renewable energy, particularly wind energy, in the supply. The life cycle of wind farms starting from manufacturing the components to decommission stage involve significant involvement of cost and the application of AI and data analytics are on reducing these costs are limited. With this conference talk, the audience expected to know some of the interesting applications of AI and data analytics on offshore wind. And, also highlight the future challenges and opportunities. This conference could be useful for students, academics and researcher who want to make next career in offshore wind but yet know where to start.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
8. 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20071987 1989
Our origins...Plan Nacional de Investigación
High-performance Computing group @ Computer Architecture Department (UPC)
Relevance
High-speed
Low-cost Parallel
Architecture Design
PA85-0314
High Performance
Computing
TIC95-429
Architectures and
Compilers
for Supercomputers
TIC92-880
Parallelism
Exploitation in High
Speed Architectures
TIC89-299
High Performance
Computing II
TIC98-511-C02-01
High Performance
Computing III
TIC2001-995-C02-01
High Performance
Computing IV
TIN2004-07739-C02-01
High Performance
Computing VI
TIN2012-34557
2008 - 2011 2012 - 20151988
High Performance
Computing V
TIN2007-60625
CEPBA CIRI BSC
COMPAQ
INTEL
MICROSOFT
IBM
INTEL
(Exascale)
NVIDIA
REPSOL
SAMSUNG
IBERDROLA
Excellence
10. Barcelona Supercomputing Center
Centro Nacional de Supercomputación
Spanish Government 60%
Catalonian Government 30%
Univ. Politècnica de Catalunya (UPC) 10%
BSC-CNS is
a consortium
that includes
BSC-CNS objectives
Supercomputing services
to Spanish and
EU researchers
R&D in Computer,
Life, Earth and
Engineering Sciences
PhD programme,
technology transfer,
public engagement
11. Barcelona Supercomputing Center
Centro Nacional de Supercomputación
475 people from
44 countries
*31th of December 2016
Competitive
project funding
secured
(2005 to 2017)
Total 144,8 M€Information compiled 16/01/2017
Europe 71,9M€
National 34 M€
Companies 38,9 M€
12. The MareNostrum 3 Supercomputer
Over 1015 Floating Point Operations per second
70% PRACE 24% RES 6% BSC-CNS
3 PB
of disk storage
100.8 TB
of main memory
Nearly
50,000 cores
13. The MareNostrum 4 Supercomputer
Total peak performance
13,7 Pflops/s
12 times more powerful than MareNostrum 3
Compute
General Purpose, for current BSC workload
More than 11 Pflops/s
With 3,456 nodes of Intel Xeon V5 processors
Emerging Technologies, for evaluation
of 2020 Exascale systems
3 systems, each of more than 0,5 Pflops/s
with KLN/KNH, Power+NVIDIA, ARMv8
Storage
More than 10 PB of GPFS
Elastics Storage System
Network
IB EDR/OPA
Ethernet
Operating System: SuSE
14. Mission of BSC Scientific Departments
Earth
Sciences
CASE
Computer
Sciences
Life
Sciences
To influence the way machines are built, programmed and
used: computer architecture, programming models,
performance tools, Big Data, Artificial Intelligence
To develop and implement global and
regional state-of-the-art models for short-
term air quality forecast and long-term
climate applications
To understand living organisms by means of
theoretical and computational methods
(molecular modeling, genomics, proteomics)
To develop scientific and engineering software to
efficiently exploit super-computing capabilities
(biomedical, geophysics, atmospheric, energy, social
and economic simulations)
15. Design of Superscalar Processors
Simple interface
Sequential
program
ILP
ISA
Programs
“decoupled”
from hardware
Applications
Decoupled from the software stack
16. Latency Has Been a Problem from the
Beginning...
• Feeding the pipeline with the right instructions:
• HW/SW trace cache (ICS’99)
• Prophet/Critic Hybrid Branch Predictor (ISCA’04)
• Locality/reuse
• Cache Memory with Hybrid Mapping (IASTED87). Victim Cache
• Dual Data Cache (ICS¨95)
• A novel renaming mechanism that boosts software prefetching (ICS’01)
• Virtual-Physical Registers (HPCA’98)
• Kilo Instruction Processors (ISHPC03,HPCA’06, ISCA’08)
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
17. … and the Power Wall Appeared Later
• Better Technologies
• Two-level organization (Locality Exploitation)
• Register file for Superscalar (ISCA’00)
• Instruction queues (ICCD’05)
• Load/Store Queues (ISCA’08)
• Direct Wakeup, Pointer-based Instruction Queue Design (ICCD’04,
ICCD’05)
• Content-aware register file (ISCA’09)
• Fuzzy computation (ICS’01, IEEE CAL’02, IEEE-TC’05). Currently known as
Approximate Computing
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
18. Fuzzy computation
Accuracy Size
Performance
@ Low Power
Binary
systems
(bmp)
Compresion
protocols
(jpeg)
Fuzzy
Computation
This one only used
~85% of the time
while consuming
~75% of the power
This image is the
original one
19. SMT and Memory Latency …
• Simultaneous Multithreading (SMT)
• Benefits of SMT Processors:
• Increase core resource utilization
• Basic pipeline unchanged:
• Few replicated resources, other shared
• Some of our contributions:
• Dynamically Controlled Resource Allocation (MICRO 2004)
• Quality of Service (QoS) in SMTs (IEEE TC 2006)
• Runahead Threads for SMTs (HPCA 2008)
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
Thread 1
Thread N
20. Time Predictability (in multicore and SMT processors)
• Where is it required:
• Increasingly required in handheld/desktop devices
• Also in embedded hard real-time systems (cars, planes, trains, …)
• How to achieve it:
• Controlling how resources are assigned to co-running tasks
• Soft real-time systems
• SMT: DCRA resource allocation policy (MICRO 2004, IEEE Micro 2004)
• Multicores: Cache partitioning (ACM OSR 2009, IEEE Micro 2009)
• Hard real-time systems
• Deterministic resource ‘securing’ (ISCA 2009)
• Time-Randomised designs (DAC 2014 best paper award)
QoS
spaceDefinition:
• Ability to provide a minimum performance to a task
• Requires biasing processor resource allocation
23. The MultiCore Era
Moore’s Law + Memory Wall + Power Wall
UltraSPARC T2
(2007)
Intel Xeon
7100 (2006)
POWER4
(2001)
Chip MultiProcessors (CMPs)
24. How Multicores Were Designed at the Beginning?
IBM Power4 (2001)
• 2 cores, ST
• 0.7 MB/core L2,
16MB/core L3 (off-chip)
• 115W TDP
• 10GB/s mem BW
IBM Power7 (2010)
• 8 cores, SMT4
• 256 KB/core L2
16MB/core L3 (on-chip)
• 170W TDP
• 100GB/s mem BW
IBM Power8 (2014)
• 12 cores, SMT8
• 512 KB/core L2
8MB/core L3 (on-chip)
• 250W TDP
• 410GB/s mem BW
25. How To Parallelize Future Applications?
• From sequential to parallel codes
• Efficient runs on manycore processors
implies handling:
• Massive amount of cores and available
parallelism
• Heterogeneous systems
• Same or multiple ISAs
• Accelerators, specialization
• Deep and heterogeneous memory hierarchy
• Non-Uniform Memory Access (NUMA)
• Multiple address spaces
• Stringent energy budget
• Load Balancing
Programmability Wall
Interconnect
L2 L2
DRAM
DRAM
MC
L3 L3 L3L3
MRAM
MRAM
C
C
C
C
ClusterInterconnect
C C
C C
C
C
C
C
ClusterInterconnect
C C
C C
C CA A
26. Living in the Programming Revolution
Multicores made the
interface to leak…
ISA / API
Parallel hardware
with multiple
address spaces
(hierarchy, transfer),
control flows, …
Applications
Parallel application
logic
+
Platform specificites
Applications
27. The efforts are
focused on
efficiently using the
underlying
hardware
ISA / API
Vision in the Programming Revolution
Need to decouple again
General purpose
Single address space
Application logic
Arch. independentApplications
Power to the runtime
PM: High-level, clean, abstract interface
33. Decouple
how we write
form
how it is executed
… and Executed in a Data-Flow Model
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) inout(B)
void scale_add (float sum, float A[BS],
float B[BS]);
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);
1 1 1 2
2 2 2 3
2 3 54
7
6
8
6
7
6
8
7
for (i=0; i<N; i+=BS) // C=A+B
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) //sum(C[i])
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*A
scale_add (sum, &E[i], &B[i]);
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);
...
for (i=0; i<N; i+=BS) // E=G+F
vadd3 (&G[i], &F[i], &E[i]);
Write
Execute
Color/number: a possible order of task execution
34. OmpSs: Potential of Data Access Info
• Flat global address space seen by
programmer
• Flexibility to dynamically traverse
dataflow graph “optimizing”
• Concurrency. Critical path
• Memory access: data transfers
performed by run time
• Opportunities for automatic
• Prefetch
• Reuse
• Eliminate antidependences (rename)
• Replication management
• Coherency/consistency handled by
the runtime
• Layout changes
Processor
CPU
On-chip cache
Off-chip BW
CPU
Main Memory
35. PPU
User
main
program
CellSs PPU lib
SPU0
DMA in
Task execution
DMA out
Synchronization
CellSs SPU lib
Original task
code
Helper threadmain thread
Memory
User
data
Renaming
Task graph
Synchronization
Tasks
Finalization
signal
Stage in/out
data
Work
assignment
Data dependence
Data renaming
Scheduling
SPU1
SPU2
SPE threads
FUFUFU
Helper thread
CellSs implementation
IFU
REG
ISSIQRENDEC
RET
Main thread
P. Bellens, et al, “CellSs: A Programming Model for the Cell BE Architecture” SC’06.
P. Bellens, et al, “CellSs: Programming the Cell/B.E. made easier” IBM JR&D 2007
36. Renaming @ Cell
• Experiments on the CellSs (predecessor of OmpSs)
• Renaming to avoid anti-dependences
• Eager (similarly done at SS designs)
• At task instantiation time
• Lazy (similar to virtual registers)
• Just before task execution
P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory
Hierarchy” Sci. Prog. 2009
Main memory transfers (cold)
Main Memory transfers
(capacity)
Killed transfers
SMPSs: Stream benchmark reduction in execution time
SMPSs: Jacobi reduciton in # remanings
37. Data Reuse @ Cell
P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy” Sci. Prog. 2009
Matrix-matrix multiply
• Experiments on the CellSs
• Data Reuse
• Locality arcs in dependence graph
• Good locality but high overhead no time improvement
38. Reducing Data Movement @ Cell
• Experiments on the CellSs (predecessor of
OmpSs)
• Bypassing / global software cache
• Distributed implementation
• @each SPE
• Using object descriptors managed atomically with
specific hardware support (line level LL-SC)
Main memory:
cold
Main memory:
capacity
Global
software cache
Local
software cache
P. Belens et al, “Making the Best of Temporal Locality: Just-In-Time Renaming
and Lazy Write-Back on the Cell/B.E.” IJHPC 2010
DMA Reads
39. GPUSs implementation
• Architecture implications
• Large local store O(GB) large task granularity Good
• Data transfers: Slow, non overlapped Bad
• Cache management
• Write-through
• Write-back
• Run time implementation
• Powerful main processor and multiple cores
• Dumb accelerator (not able to perform data transfers, implement
software cache,…)
Slave threads
FUFUFU
Helper thread
IFU
REG
ISSIQRENDEC
RET
Main thread
E. Ayguade, et al, “An Extension of the StarSs Programming Model for Platforms with Multiple GPUs” Europar 2009
40. Prefetching @ multiple GPUs
• Improvements in runtime mechanisms (OmpSs +
CUDA)
• Use of multiple streams
• High asynchrony and overlap (transfers and kernels)
• Overlap kernels
• Take overheads out of the critical path
• Improvement in schedulers
• Late binding of locality aware decisions
• Propagate priorities
J. Planas et al, “Optimizing Task-based Execution Support on Asynchronous Devices.” Submitted
Nbody
Cholesky
41. ISA / API
Runtime Aware Architectures
The runtime drives the hardware design
Applications
Runtime
PM: High-level, clean, abstract interface
Task based PM
annotated by the user
Data dependencies
detected at runtime
Dynamic scheduling
“Reuse” architectural
ideas under
new constraints
42. Superscalar vision at Multicore level
Programmability
Wall
Resilience Wall
Memory Wall Power Wall
Superscalar World
Out-of-Order, Kilo-Instruction Processor,
Distant Parallelism
Branch Predictor, Speculation
Fuzzy Computation
Dual Data Cache, Sack for VLIW
Register Renaming, Virtual Regs
Cache Reuse, Prefetching, Victim C.
In-memory Computation
Accelerators, Different ISA’s, SMT
Critical Path Exploitation
Resilience
Multicore World
Task-based, Data-flow Graph, Dynamic
Parallelism
Tasks Output Prediction,
Speculation
Hybrid Memory Hierarchy, NVM
Late Task Memory Allocation
Data Reuse, Prefetching
In-memory FU’s
Heterogeneity of Tasks and HW
Task-criticality
Resilience
Load Balancing and Scheduling
Interconnection Network
Data Movement
43. Architecture Proposals in RoMoL
C C
L1
ClusterInterconnect
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
Stacked
DRAM
External
DRAM
L2
L3 cache
Cluster Interconnect
- Priority-based arbitration
- By-pass routing
Runtime Support Unit
- DVFS
- Light-weight deps tracking
- Task memoization
- Reduced data motion
Vectors
- DB, sorting
- BTrees
Cache Hierarchy
- LM usage
- Coherence
- Eviction policies
- Reductions
PICOS
44. Runtime Management of Local Memories (LM)
LM Management in OmpSs
– Task inputs and outputs mapped to the LMs
– Runtime manages DMA transfers
8.7% speedup in execution time
14% reduction in power
20% reduction in network-on-chip traffic
0,8
0,9
1
1,1
1,2
jacobi kmeans md5 tinyjpeg vec_add vec_red
Speedup
Cache
Hybrid
Ll. Alvarez et al. Transparent Usage of Hybrid on-Chip Memory Hierarchies in Multicores. ISCA 2015.
Ll. Alvarez et al Runtime-Guided Management of Scratchpad Memories in Multicore Architectures. PACT 2015
C C
L1
ClusterInterconnect
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
Stacked
DRAM
External
DRAM
L2
L3 cache
PICOS
45. OmpSs in Heterogeneous Systems
Heterogeneous systems
• Big-little processors
• Accelerators
• Hard to program
big
little
big big
big
little little
little
Task-based programming models can adapt to these scenarios
• Detect tasks in the critical path and run them in fast cores
• Non-critical tasks can run in slower cores
• Assign tasks to the most energy-efficient HW component
• Runtime takes core of balancing the load
• Same performance with less power consumption
46. Architectural Support for DVFS
Reduce overheads of software solution
– Serialization in DVFS reconfigurations
– User-kernel mode switches
Runtime Support Unit (RSU)
– Power budget
– State of cores
– Criticality of running tasks
Runtime system notifies RSU
– Start task execution
• Criticality
• Running core
– End task execution
Same algorithm for DVFS
reconfigurations
Runtime system
Scheduler
HPRQ LPRQ
A A NA NA
State
C C NC NC
Criticality
Power budget 2
SWHW
Core 0 Core 1 Core 2 Core 3
DVFS
Cntrl
RSU
47. Architectural Support for DVFS
Runtime system
Scheduler
HPRQ LPRQ
A A NA NA
State
C C NC NC
Criticality
Power budget 2
SWHW
Core 0 Core 1 Core 2 Core 3
DVFS
Cntrl
RSU
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
1,4
Speedup
FIFO CATS CATA CATA+RSU TurboMode
0,4
0,5
0,6
0,7
0,8
0,9
1
1,1
EDP
E. Castillo, CATA: Criticality Aware Task Acceleration for Multicore Processors (IPDPS’16)
48. TaskSuperscalar (TaskSs) Pipeline
• Hardware design for a distributed task
superscalar pipeline frontend (MICRO’10)
• Can be embedded into any manycore fabric
• Drive hundreds of threads
• Work windows of thousands of tasks
• Fine grain task parallelism
• TaskSs components:
• Gateway (GW): Allocate resources for task meta-data
• Object Renaming Table (ORT)
• Map memory objects to producer tasks
• Object Versioning Table (OVT)
• Maintain multiple object versions
• Task Reservation Stations (TRS)
• Store and track task in-flght meta-data
• Implementing TaskSs @ Xilinx Zynq
GW
TRS
ORT
Ready Queue
OVT
TaskSs pipeline
Scheduler
C C C C
C C C C
C C C C
C C C C
Multicore Fabric
Y. Etsion et al, “Task Superscalar: An Out-of-Order Task Pipeline” MICRO-43, 2010
49. Architectural Support for Task Dependence
Management (TDM) with Flexible Software Scheduling
• Task creation is a bottleneck since it
involves dependence tracking
• Our hardware proposal (TDM)
• takes care of dependence tracking
• exposes scheduling to the SW
• Our results demonstrate that this
flexibility allows TDM to beat the
state-of-the-art
E. Castillo et al, Architectural Support for Task Dependence Management
with Flexible Software Scheduling submitted to MICRO’17)
50. Approximate Task Memoization (ATM)
• Approximate Task Memoization (ATM) aims at eliminating
redundant computations.
• ATM leverages runtime system metadatata to identify tasks
that can be memoized.
• ATM achieves 1.4x average speedup when only applying
memoization techniques (Static ATM).
• ATM achieves an increased 2.5x average speedup with an average
0.7% accuracy loss with task approcimation (Dynamic ATM).
I. Brumar et al, ATM: Approximate Task Memoization in the Runtime System (IPDPS’17)
51. Exploiting the Task Dependency Graph
(TDG) to Reduce Coherence Traffic
• To reduce coherence traffic, the
state-of-the-art applies round-robin
mechanisms at the runtime level.
• Exploiting the information
contained at the TDG level is
effective to
• improve performance
• dramatically reduce coherence
traffic (2.26x reduction with respect
to the state-of-the-art).
State-of-the-art Partition (DEP)
Gauss-Seidel TDG
DEP requires ~200GB of
data transfer across a 288
cores system
52. Exploiting the Task Dependency Graph
(TDG) to Reduce Coherence Traffic
• To reduce coherence traffic, the state-
of-the-art applies round-robin
mechanisms at the runtime level.
• Exploiting the information contained at
the TDG level is effective to
• improve performance
• dramatically reduce coherence traffic
(2.26x reduction with respect to the
state-of-the-art).
Graph Algorithms-Driven Partition (RIP-DEP)
Gauss-Seidel TDG
RIP-DEP requires ~90GB
of data transfer across a
288 cores system
I. Sánchez et al, Reducing Data Movements on Shared Memory
Architectures (submitted to SC’17)
53. Dealing with a New Form Of Heterogeneity
• Manufacturing Variability of CPUs – Different
power consumption
• Power variability becomes performance
heterogeneity in power constrained
environments
• Typical load-balancing may not be
sufficient
• Redistributing power and number
of active cores among sockets can
improve performance
even distribution
optimal distribution
54. Dynamic Analysis and Exploration
• Statically trying all configurations is not practical
• Huge overhead (one execution for each configuration)
• Has to be performed on each node
• Online analysis: Try multiple configurations in a single run.
Performance Benefit Energy
Savings
D. Chasapis et al, Runtime-Guided Mitigation of Manufacturing Variability
in Power-Constrained Multi-Socket NUMA Nodes (ICS’16)
55. Introduction - A New Form Of Heterogeneity
• Platform: 2 x sockets with 12 core Intel Xeon E5-2695v2
• Power variability becomes performance heterogeneity in
power constrained environments
56. Hash Join, Sorting, Aggregation, DBMS
• Goal: Vector acceleration of data bases
• “Real vector” extensions to x86
• Pipeline operands to the functional unit (like Cray machines,
not like SSE/AVX)
• Scatter/gather, masking, vector length register
• Implemented in PTLSim + DRAMSim2
• Hash join work published in MICRO 2012
• 1.94x (large data sets) and 4.56x (cache resident data sets)
of speedup for TPC-H
• Memory bandwidth is the bottleneck
• Sorting paper published in HPCA 2015
• Compare existing vectorized quicksort, bitonic mergesort,
radix sort on a consistent platform
• Propose novel approach (VSR) for vectorizing radix sort with
2 new instructions
• Similarity with AVX512-CD instructions
(but cannot use Intel’s instructions because the
algorithm requires strict ordering)
• Small CAM
• 3.4x speedup over next-best vectorised algorithm with the
same hardware configuration due to:
• Transforming strided accesses to unit-stride
• Elminating replicated data structures
• Ongoing work on aggregations
• Reduction to a group of values, not a single scalar value
ISCA 2016
• Building from VSR work
0
2
4
6
8
10
12
14
16
18
20
22
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
quicksort bitonic radix vsr
speedupoverscalarbaseline
1 lane 2 lanes 4 lanes
57. Overlap Communication and Computation
• Hybrid MPI/OmpSs: Linpack example
• Extend asynchronous data-flow
execution to outer level
• Taskify MPI communication primitives
• Automatic lookahead
• Improved performance
• Tolerance to network bandwidth
• Tolerance to OS noise
P0 P1 P2
V. Marjanovic et al, “Overlapping Communication and Computation by using a
Hybrid MPI/SMPSs Approach” ICS 2010
58. Effects on Bandwidth
flattening
communication pattern
thus
reducing
bandwidth requirements
*simulation on application with
ring communication pattern
V. Subotic et al. “Overlapping communication and computation by
enforcing speculative data-flow”, January 2008, HiPEAC
59. Related Work
• Rigel Architecture (ISCA 2009)
• No L1D, non-coherent L2, read-only, private and cluster-shared data
• Global accesses bypass the L2 and go directly to L3
• SARC Architecture (IEEE MICRO 2010)
• Throughput-aware architecture
• TLBs used to access remote LMs and migrate data accross LMs
• Runnemede Architecture (HPCA 2013)
• Coherence islands (SW managed) + Hierarchy of LMs
• Dataflow execution (codelets)
• Carbon (ISCA 2007)
• Hardware scheduling for task-based programs
• Holistic run-time parallelism management (ICS 2013)
• Runtime-guided coherence protocols (IPDPS 2014)
60. RoMoL … papers
• V. Marjanovic et al., “Effective communication and computation overlap with
hybrid MPI/SMPSs.” PPoPP 2010
• Y. Etsion et al., “Task Superscalar: An Out-of-Order Task Pipeline.” MICRO 2010
• N. Vujic et al., “Automatic Prefetch and Modulo Scheduling Transformations for
the Cell BE Architecture.” IEEE TPDS 2010
• V. Marjanovic et al., “Overlapping communication and computation by using a
hybrid MPI/SMPSs approach.” ICS 2010
• T. Hayes et al., “Vector Extensions for Decision Support DBMS Acceleration”.
MICRO 2012
• L. Alvarez,et al., “Hardware-software coherence protocol for the coexistence of
caches and local memories.” SC 2012
• M. Valero et al., “Runtime-Aware Architectures: A First Approach”. SuperFRI
2014
• L. Alvarez,et al., “Hardware-Software Coherence Protocol for the Coexistence of
Caches and Local Memories.” IEEE TC 2015
61. RoMoL … papers
• M. Casas et al., “Runtime-Aware Architectures”. Euro-Par 2015.
• T. Hayes et al., “VSR sort: A novel vectorised sorting algorithm & architecture
extensions for future microprocessors”. HPCA 2015
• K. Chronaki et al., “Criticality-Aware Dynamic Task Schedulling for
Heterogeneous Architectures”. ICS 2015
• L. Alvarez et al., “Coherence Protocol for Transparent Management of
Scratchpad Memories in Shared Memory Manycore Architectures”. ISCA 2015
• L. Alvarez et al., “Run-Time Guided Management of Scratchpad Memories in
Multicore Architectures”. PACT 2015
• L. Jaulmes et al., “Exploiting Asycnhrony from Exact Forward Recoveries for DUE
in Iterative Solvers”. SC 2015
• D. Chasapis et al., “PARSECSs: Evaluating the Impact of Task Parallelism in the
PARSEC Benchmark Suite.” ACM TACO 2016.
• E. Castillo et al., “CATA: Criticality Aware Task Acceleration for Multicore
Processors.” IPDPS 2016
62. RoMoL … papers
• T. Hayes et al “Future Vector Microprocessor Extensions for Data
Aggregations.” ISCA 2016.
• D. Chasapis et al., “Runtime-Guided Mitigation of Manufacturing Variability
in Power-Constrained Multi-Socket NUMA Nodes.” ICS 2016
• P. Caheny et al., “Reducing cache coherence traffic with hierarchical
directory cache and NUMA-aware runtime scheduling.” PACT 2016
• T. Grass et al., “MUSA: A multi-level simulation approach for next-
generation HPC machines.” SC 2016
• I. Brumar et al., “ATM: Approximate Task Memoization in the Runtime
System.” IPDPS 2017
• K. Chronaki et al., “Task Scheduling Techniques for Asymmetric Multi-Core
Systems.” IEEE TPDS 2017
• C. Ortega et al., “libPRISM: An Intelligent Adaptation of Prefetch and SMT
Levels.” ICS 2017
63. Roadmaps to Exaflop
From Tianhe-2..
…to Tianhe-2A
with domestic
technology.
From K computer…
… to Post K
with domestic
technology.
From the PPP for
HPC…
to future PRACE
systems…
…with domestic
technology
with domestic
technology.
IPCEI on HPC
?
64. HPC is a global competition
“The country with the strongest computing capability
will host the world’s next scientific breakthroughs”.
US House Science, Space and Technology Committee Chairman
Lamar Smith (R-TX)
“Our goal is for Europe to become one of the top 3
world leaders in high-performance computing by 2020”.
European Commission President
Jean-Claude Juncker (27 October 2015)
“Europe can develop an exascale machine with
ARM technology. Maybe we need an .
consortium for HPC and Big Data”.
Seymour Cray Award Ceremony Nov. 2015
Mateo Valero
65. HPC: a disruptive technology for Industry
“…Europe has a unique opportunity to act and
invest in the development and deployment of High
Performance Computing (HPC) technology, Big
Data and applications to ensure the
competitiveness of its research and its industries.”
Günther Oettinger, Digital Economy & Society
Commissioner
“The transformational impact of
excellent science in research and
innovation”
Final plenary panel at ICT - Innovate, Connect,
Transform conference, 22 Oct 2015, Lisbon.
66. BSC and the EC
“Europe needs to develop an entire
domestic exascale stack from the
processor all the way to the system
and application software“
Mateo Valero, Director of Barcelona
Supercomputing Center
Final plenary panel at ICT - Innovate,
Connect, Transfor”m conference, 22
October 2015 Lisbon, Portugal.
the transformational impact of excellent science in
research and innovation
67. Mont-Blanc HPC Stack for ARM
Industrial applications
System software
Hardware
Applications
68. 512 RiscV cores in 64 clusters, 16GF/core: 8TF
4 HBM stacks (16GB, 1TB/s each): 64GB @ 4TB/s
16 custom SCM/Flash channels (1TB, 25GB/s each): 16TB @ 0.4TB/s
BSC Accelerator RISC-V ISA
Vector Unit
· 2048b vector
· 512b alu (4clk/op)
1 GHz @ Vmin
OOO
4w Fetch
· 64KB I$
· Decoupled I$/BP
· 2 level BP
· Loop Stream Detector
4w Rename/Retire
D$
· 64KB
· 64B/line
· 128 in-flight misses
· Hardware prefetch
1MB L2 per core
D$ to L2
· 1x512b read
· 1x512b write
L2 to mesh
· 1x512b read
· 1x512b write
Cluster holds snoop
filter
interposer
cyclone
mem mem
mem mem
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
A D D H H H H D D A
D C C C C C C C C D
D C C C C C C C C D
H C C C C C C C C H
H C C C C C C C C H
H C C C C C C C C H
H C C C C C C C C H
D C C C C C C C C D
D C C C C C C C C D
A D D H H H H D D A
Interposer
cyclone flashmem
Package substrate
69. Do we need an
type consortium for HPC and Big Data?
A window of opportunity is open:
• Basic industrial and scientific know-how is available
• Excellent funding opportunities exist in H2020 at European level
and in the member state structural funds
It’s time to invest in large Flagship projects
for HPC to gain critical mass
HPC European strategy & Innovation
http://ec.europa.eu/commission/2014-2019/oettinger/blog/mateo-valero-
director-barcelona-supercomputing-center_en
70. A New Era of Information Technology
Current infrastructure sagging under its own weight
Prof. Mateo Valero – Big Data
Pervasive
Connectivity
Explosion of
Information
400,710 ad
requests
2000 lyrics played
on Tunewiki
1,500 pings sent on
PingMe
208,333 minutes
Angry Birds played
23,148 apps
downloaded
98,000 tweets
Smart Device
Expansion
In 60 sec
today
2013
30
Billion
By 2020
40
Trillion GB
… for 8
Billion
10
Million
DATA
(1)
(2)
(3)
Devices
Mobile
Apps
(4)
(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org
Internet of Things
HPC European strategy & Innovation
71. The Data Deluge
2020
40ZB*
(figure exceeds prior forecasts
by 5 ZBs)
2005 2010
2012
2015
8.5ZB
2.8ZB
1.2ZB0.1ZB
* Source: IDC
Prof. Mateo Valero – Big Data
72. 10
30
This will take us
beyond our
decimal system
Geopbyte
This will be our digital
universe tomorrow…
Brontobyte
10
27
10
24This is our digital universe today
= 250 trillion of DVDs
Yottabyte
1021
1.3 ZB of network
traffic by 2016
Zettabyte
10
18
1 EB of data is created on the internet each day = 250 million
DVDs worth of information. The proposed Square Kilometer
Array telescope will generated an EB of data per day
Exabyte
10
12
Terabyte
500TB of new data per day are ingested in Facebook databases
1015
Petabyte
The CERN Large Hadron Collider
generates 1PB per second
109
Gigabyte
10
6
Megabyte
How big is big?
Saganbyte, Jotabyte,…
Prof. Mateo Valero – Big Data
73. Higgs and Englert’s Nobel for Physics 2013
Last year one of the most computer-intensive scientific
experiments ever undertaken confirmed Peter Higgs and
François Englert’s theory by making the Higgs boson – the so-
called “God particle” – in an $8bn atom smasher, the Large
Hadron Collider at Cern outside Geneva.
“ the LHC produces 600 TB/sec… and after filtering needs to
store 25 PB/year”… 15 million sensors….
74. Big Data in Biology
• High resolution imaging
• Clinical records
• Simulations
• Omics
75. Sequencing Costs
75Prof. Mateo Valero – Big Data
Source: National Human Genome Research Institute (NHGRI)
http://www.genome.gov/sequencingcosts/
(1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality
(2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001
In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and
capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA
sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA
sequencing technologies that has occurred in recent years.
76. Cognitive Computing
Acuerdo de colaboración para promover
conjuntamente el desarrollo de sistemas
avanzados de “deep learning” con
aplicaciones a los servicios bancarios
77. Example: Cognitive Computing is already in
business
• In 2011 IBM Watson computer defeated two of Jeopardy’ s
greatest champions
77
Prof. Mateo Valero – Big Data
Since then, Watson
supercomputer has
become 24 times
faster and smarter,
90% smaller, with a
2,400% improvement
in performance
Watson Group has
collaborated with
partners to build
6,000 apps
78. Neural Networks
• Computational model in computer science
based on a collection of simple neural units.
• Each neural unit is connected to many others
• The strengths of these connections is
expressed in terms of weights.
• Neural units compute summation functions
• NN are self-learning and can be trained
• NN are particularly good in feature detection.
• In practice, NN can be expressed in
terms of matrix-matrix multiplications.
82. BSC strategy for Artificial Intelligence
Social &
Personal
Data
Organ
simulation
Earth
Sciences
Industrial
CASE apps
Medical
Imaging
Genomic
Analytics
Text
Analytics
Programming models and runtimes
(PyCOMPSs, TIRAMISU, interoperability current approaches)
Data models and algorithms
(approximate computing -- reduced precision, adaptive layers, DL/Graph Analytics, …)
Precision medicine Other domains
Data platforms + standards
Projects with public/private institutions and companies
Hw acceleration of DL workloads
(novel architectures for NN, FPGA acceleration)
83. NVIDIA Tesla P4 and P40 GPU’s (2016)
• Tesla P4
• # CUDA cores: 2560 @ 1063MHz
• Peak single precision: 5.5TFLOPS
• Peak INT8: 22 TOPS
• Low precision: 8-bit dot-product
with 32-bit accumulate
• VRAM: 8 GB GDDR5 @ 192 GB/s
• TDP: ~75W
• Tesla P40
• # CUDA cores: 2560 @ 1531MHz
• Peak single precision: 12.0TFLOPS
• Peak INT8: 47 TOPS
• Low precision: 8-bit dot-product
with 32-bit accumulate
• VRAM: 24 GB GDDR5 @ 346GB/s
• TDP: ~250W
83
Source: NVIDIA
84. Google Tensor Processing Unit (2015,
published 2017)
• 34 GB/s off-chip memory
• 28MB on-chip memory
• Frequency 700MHz
• TDP 75W
• Matrix Multiply Unit
• 256x256 MAC Units
• 8-bit multiply and adds
• 32-bit accumulators
• Peak Throughput: 92 TOPS/s
• Power Efficiency: 132 GOPS/W
• GPUs for training, TPUs for inference
• Gameplay to beat the World Go champion
• Internally used at google for Streetview and
• Rankbrain search optimizer
84
Source: google
85. Nervana’s Lake Crest Deep Learning
Architecture (2017)
• The Lake Crest chip will operate as a Xeon Co-processor.
• Tensor-based (i. e. dense linear algebra computations)
• 4 8GB HBM2 at the same chip interposer@1TB/s
• Each HBM has its own memory controller
• 12 Inter-Chip Links (ICL) 20x faster than PCI
• 12 Computing Nodes featuring several cores
• Intel's new "Flexpoint" architecture within the Nodes
• Flexpoint enables 10x ILP increase and low power consumption
85
Source: elektroniknet
87. Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
87
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
HL of ~1024 neurons can identify
simple images
– 28x28 pixel images
– Each image contains a digit 0-9
88. Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
88
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
HL of ~4096 neurons can identify
images containing a single
concept
– 32x32 pixel images
– Each image is classified by
categories like “ship”, “cat” or
“deer”.
89. Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
89
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
Lake Crest’s mem BW (~TB/s)
targets very large HL with
O(10,000-100,000) neurons
These NN are
used for complex
image analysis
90. BSC Proposal for Deep Learning
90
16 2D-systolic arrays 4096x4096@1GHz: 134TOP/s
4 HBM stacks (16GB@1TB/s each): 64 GB @ 4TB/s
DDR5 SDRAM (384GB@180GB/s): 384GB @ 0.18TB/s
SoC
HBM HBM
HBMHBM
Interposer
HBM MC HBM MC
HBM MCHBM MC
DDR
DDR
Switch General
Purpouse
SoC
Switch
Syst Arrays
Syst Arrays
91. Human Brain Project
• 10-year, 1000M€ FET flagship project
• Goal: to pull together all existing
knowledge about the human brain and
to reconstruct the brain in
supercomputer based models and
simulations.
91
Expected outcomes: new treatments for brain disease
and new brain-like computing technologies
BSC role: Provision and optimisation of programming
models to allow simulations to be developed efficiently
MareNostrum part of the HPC platform for simulations
93. IBM TrueNorth Processor
• 64*64=4096 cores
• 256 neurons/core, 64K synapses/core
• 104Kb/core memory
• 65Kb for synapse states
• 32Kb for neuron states/parameters
• 6Kb for router destination addresses
• 1Kb for axonal delays
• 20mW/cm2 power density
• 72mW at 0.75V
• 46 Billion SOPS/Watt (Synoptic
Operations Per Second) typ.
• 400 Billion SOPS/Watt max.
• Compared to SoA supercomputer
at 4.5 Billion FLOPS/Watt
Source: Science magazine
94. View from Europe: Heidelberg HICANN
• Wafer-scale analogue neuromorphic system
• 8” 180nm wafer:
• 200,000 neurons
• 50M synapses
• 104x faster than biology
94
95. Quantum Computing: Brave New World of
post-Moore architecture
Quantum Processors
– Dwave
– IBM
– Microsoft
– Google
– View from Europe: Delft University Prototypes
96. D-Wave Quantum Processor
Environment colder than space
Leverages superconducting quantum effect
1000 qbits, 128K josephson junctions
Installed at NSA, Google, UCSB
108X faster than Quantum Monte Carlo
Algorithm on a single core*
• Source: Denchev et al. What is the Computational Value of Finite-Range Tunneling?
Phys. Rev. August 2016
97. IBM
Building “Universal Quantum Computer”
Developed a Quantum Computing API to make developing quantum
applications easier
Promotes experimentation on
publicly available 5-qbit quantum
processor
98. Microsoft and Google
Microsoft is looking into topological quantum computing in their
global “Station Q” research consortium
Microsoft has “Quarc” lab working actively in quantum computer
architecture in Redmond.
Google manufactured a 9-bit Quantum Computer in their Quantum
AI Lab
Google ambition is to produce a viable quantum computer in the
next five years*
• Mohseni et al: “Commercialize quantum technologies in five years”,
Nature (Comments) March 2017
99. View from Europe: Delft Quantum
Prototypes
50M Euro grant from Intel
Building hybrid CMOS/Quantum processor
Doing algorithms, compilers, architecture*
* (To appear in DAC 2017) Riesebos et al. “Pauli Frames
for Quantum Computer Architectures”