This document summarizes a presentation given by Jeffrey S. Vetter at an international symposium in Kobe on preparing for extreme heterogeneity in high performance computing. The presentation highlights that contemporary HPC systems provide evidence that power constraints are driving rapid changes to processor, node, memory, and I/O architectures. Applications will not be portable across these diverse new architectures, and programming models and performance prediction tools are needed to address this challenge. The presentation also discusses emerging technologies like FPGAs, GPUs, and non-volatile memory and the need for portable programming models to support heterogeneous processing.
In this ACM Tech Talk, Doug Kothe from ORNL presents: The Exascale Computing Project and the future of HPC.
"The mission of the US Department of Energy (DOE) Exascale Computing Project (ECP) was initiated in 2016 as a formal DOE project and extends through 2022. The ECP is designing the software infrastructure to enable the next generation of supercomputers—systems capable of more than 1018 operations per second—to effectively and efficiently run applications that address currently intractable problems of strategic importance. The ECP is creating and deploying an expanded and vertically integrated software stack on US Department of Energy (DOE) HPC exascale and pre-exascale systems, thereby defining the enduring US exascale ecosystem."
Watch the video: https://wp.me/p3RLHQ-kep
Learn more: https://www.exascaleproject.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Paul Messina presented this deck at the HPC User Forum in Austin. "The Exascale Computing Project (ECP) is a collaborative effort of two US Department of Energy (DOE) organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA). As part of President Obama’s National Strategic Computing initiative, ECP was established to develop a new class of high-performance computing systems whose power will be a thousand times more powerful than today’s petaflop machines. ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development to meet the scientific and national security mission needs of DOE."
Watch the video presentation: http://wp.me/p3RLHQ-fIC
Learn more: http://insidehpc.com/ecp
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
In this video from the OpenFabrics Workshop in Austin, Al Geist from ORNL presents: Exascale Computing Project - Driving a HUGE Change in a Changing World.
"In this keynote, Mr. Geist will discuss the need for future Department of Energy supercomputers to solve emerging data science and machine learning problems in addition to running traditional modeling and simulation applications. In August 2016, the Exascale Computing Project (ECP) was approved to support a huge lift in the trajectory of U.S. High Performance Computing (HPC). The ECP goals are intended to enable the delivery of capable exascale computers in 2022 and one early exascale system in 2021, which will foster a rich exascale ecosystem and work toward ensuring continued U.S. leadership in HPC. He will also share how the ECP plans to achieve these goals and the potential positive impacts for OFA."
Learn more: https://exascaleproject.org/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: https://www.openfabrics.org/index.php/abstracts-agenda.html
In this deck from the HPC User Forum at Argonne, Doug Kothe from the Exascale Computing Project presents an ECP update.
"The Exascale Computing Project (ECP) is focused on accelerating the delivery of a capable exascale computing ecosystem that delivers 50 times more computational science and data analytic application power than possible with DOE HPC systems such as Titan (ORNL) and Sequoia (LLNL). With the goal to launch a US exascale ecosystem by 2021, the ECP will have profound effects on the American people and the world."
Watch the video: https://wp.me/p3RLHQ-kPG
Learn more: https://exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
Numerous results in reconfigurable computing research suggest that FPGAs are able to deliver greatly improved performance or energy efficiency for many computationally demanding applications. This potential is being exploited by hyperscale cloud providers, which have recently deployed large scale installations with FPGA. In contrast, FPGAs have not had any significant impact on general purpose HPC installations so far.
In this presentation, I will try to shed some light on the reasons for this development and the apparent gap between the promise and reality for FPGAs in HPC. I will discuss what the reconfigurable computing research community can and needs to provide to attract more interest from HPC users and suppliers. To highlight practical challenges, I will share some of our experiences at the Paderborn Center for Parallel Computing, where have recently commissioned two HPC testbed clusters with FPGAs and where we are currently planning to deploy FPGAs at a larger scale in our production HPC systems.
In this ACM Tech Talk, Doug Kothe from ORNL presents: The Exascale Computing Project and the future of HPC.
"The mission of the US Department of Energy (DOE) Exascale Computing Project (ECP) was initiated in 2016 as a formal DOE project and extends through 2022. The ECP is designing the software infrastructure to enable the next generation of supercomputers—systems capable of more than 1018 operations per second—to effectively and efficiently run applications that address currently intractable problems of strategic importance. The ECP is creating and deploying an expanded and vertically integrated software stack on US Department of Energy (DOE) HPC exascale and pre-exascale systems, thereby defining the enduring US exascale ecosystem."
Watch the video: https://wp.me/p3RLHQ-kep
Learn more: https://www.exascaleproject.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Paul Messina presented this deck at the HPC User Forum in Austin. "The Exascale Computing Project (ECP) is a collaborative effort of two US Department of Energy (DOE) organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA). As part of President Obama’s National Strategic Computing initiative, ECP was established to develop a new class of high-performance computing systems whose power will be a thousand times more powerful than today’s petaflop machines. ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development to meet the scientific and national security mission needs of DOE."
Watch the video presentation: http://wp.me/p3RLHQ-fIC
Learn more: http://insidehpc.com/ecp
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
In this video from the OpenFabrics Workshop in Austin, Al Geist from ORNL presents: Exascale Computing Project - Driving a HUGE Change in a Changing World.
"In this keynote, Mr. Geist will discuss the need for future Department of Energy supercomputers to solve emerging data science and machine learning problems in addition to running traditional modeling and simulation applications. In August 2016, the Exascale Computing Project (ECP) was approved to support a huge lift in the trajectory of U.S. High Performance Computing (HPC). The ECP goals are intended to enable the delivery of capable exascale computers in 2022 and one early exascale system in 2021, which will foster a rich exascale ecosystem and work toward ensuring continued U.S. leadership in HPC. He will also share how the ECP plans to achieve these goals and the potential positive impacts for OFA."
Learn more: https://exascaleproject.org/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: https://www.openfabrics.org/index.php/abstracts-agenda.html
In this deck from the HPC User Forum at Argonne, Doug Kothe from the Exascale Computing Project presents an ECP update.
"The Exascale Computing Project (ECP) is focused on accelerating the delivery of a capable exascale computing ecosystem that delivers 50 times more computational science and data analytic application power than possible with DOE HPC systems such as Titan (ORNL) and Sequoia (LLNL). With the goal to launch a US exascale ecosystem by 2021, the ECP will have profound effects on the American people and the world."
Watch the video: https://wp.me/p3RLHQ-kPG
Learn more: https://exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
Numerous results in reconfigurable computing research suggest that FPGAs are able to deliver greatly improved performance or energy efficiency for many computationally demanding applications. This potential is being exploited by hyperscale cloud providers, which have recently deployed large scale installations with FPGA. In contrast, FPGAs have not had any significant impact on general purpose HPC installations so far.
In this presentation, I will try to shed some light on the reasons for this development and the apparent gap between the promise and reality for FPGAs in HPC. I will discuss what the reconfigurable computing research community can and needs to provide to attract more interest from HPC users and suppliers. To highlight practical challenges, I will share some of our experiences at the Paderborn Center for Parallel Computing, where have recently commissioned two HPC testbed clusters with FPGAs and where we are currently planning to deploy FPGAs at a larger scale in our production HPC systems.
In this deck from the University of Houston CACDS HPC Workshop, Jeff Larkin from Nvidia presents: The Past, Present, and Future of OpenACC.
"OpenACC is an open specification for programming accelerators with compiler directives. It aims to provide a simple path for accelerating existing applications for a wide range of devices in a performance portable way. This talk with discuss the history and goals of OpenACC, how it is being used today, and what challenges it will address in the future."
Watch the video presentation: http://wp.me/p3RLHQ-dTm
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the newly released PGI 19.7, the upcoming 2019 OpenACC Annual Meeting, GPU Bootcamp at RIKEN R-CCS, a complete schedule of GPU hackathons and more!
Semantic Web Technologies for Intelligent Engineering ApplicationsMarta Sabou
Presentation at the closing event of the Christian Doppler Laboratory „Software Engineering Integration for Flexible Automation Systems“ (CDL-Flex) (http://cdl.ifs.tuwien.ac.at/).
Invited talk at SSSW'16 (http://sssw.org/2016/?page_id=232) introducing the Fourth Industrial Revolution and discussing how Semantic Web technologies can support this movement. Also a teaser for the upcoming Springer book "Semantic Web for Intelligent Engineering Applications" (http://www.springer.com/us/book/9783319414881).
This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, upcoming GPU Hackathons and Bootcamps, PEARC21 panel review , recent research, new resources and more!
The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. High-Level Synthesis (HLS) tools already provide an handy way to describe an FPGA-based hardware implementations starting from a software description of an algorithm. However, HLS directives allow to improve the hardware design only from a computational perspective, requiring a manual code restructuring in case memory transfer needs optimizing. This aspect limits the effectiveness of Design Space Exploration (DSE) approaches that only target HLS directives. Therefore, we present a comprehensive methodology to support the designer in the generation of optimal HLS-based hardware implementations. First, we propose an automated roofline model generation that directly operates on a C/C++ description of the target algorithm. The approach enables a fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we introduce a DSE methodology for quickly evaluating different HLS directives to identify an optimal implementation. We report the DSE performance when running on the PolyBench test suite, outperforming previous automated solutions in the literature. Finally, we illustrate the process of accelerating by means of our framework a complex application such as the N-body physics simulation algorithm, achieving results comparable to bespoke state-of-the-art implementations.
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, GCC, upcoming GPU Hackathons and Bootcamps, Sunita Chandrasekaran named as PI for SOLLVE Project, recent research and more!
Stay up-to-date with the OpenACC Monthly Highlights. June's edition covers the OpenACC Summit 2021, NVIDIA GTC'21 on-demand sessions, upcoming GPU Hackathons and Bootcamps, Intersect360 Research HPC market forecast, recent research, new resources and more!
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest performance per watt, rather than the highest performance or lowest power.
In this deck from the 2018 Rice Oil & Gas Conference, Doug Kothe from ORNL provides an update on the Exascale Computing Project.
“The quest to develop a capable exascale ecosystem is a monumental effort that requires the collaboration of government, academia, and industry. Achieving exascale will have profound effects on the American people and the world—improving the nation’s economic competitiveness, advancing scientific discovery, and strengthening our national security.”
Watch the video: https://wp.me/p3RLHQ-idv
Learn more: https://www.exascaleproject.org/
and
http://rice2018oghpc.rice.edu/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this video from the Argonne Training Program on Extreme-Scale Computing 2019, Jeffrey Vetter from ORNL presents: The Coming Age of Extreme Heterogeneity.
"In this talk, I'm going to talk about the high-level trends guiding our industry. Moore’s Law as we know it is definitely ending for either economic or technical reasons by 2025. Our community must aggressively explore emerging technologies now!"
Watch the video: https://wp.me/p3RLHQ-lic
Learn more: https://ft.ornl.gov/~vetter/
and
https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the University of Houston CACDS HPC Workshop, Jeff Larkin from Nvidia presents: The Past, Present, and Future of OpenACC.
"OpenACC is an open specification for programming accelerators with compiler directives. It aims to provide a simple path for accelerating existing applications for a wide range of devices in a performance portable way. This talk with discuss the history and goals of OpenACC, how it is being used today, and what challenges it will address in the future."
Watch the video presentation: http://wp.me/p3RLHQ-dTm
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the newly released PGI 19.7, the upcoming 2019 OpenACC Annual Meeting, GPU Bootcamp at RIKEN R-CCS, a complete schedule of GPU hackathons and more!
Semantic Web Technologies for Intelligent Engineering ApplicationsMarta Sabou
Presentation at the closing event of the Christian Doppler Laboratory „Software Engineering Integration for Flexible Automation Systems“ (CDL-Flex) (http://cdl.ifs.tuwien.ac.at/).
Invited talk at SSSW'16 (http://sssw.org/2016/?page_id=232) introducing the Fourth Industrial Revolution and discussing how Semantic Web technologies can support this movement. Also a teaser for the upcoming Springer book "Semantic Web for Intelligent Engineering Applications" (http://www.springer.com/us/book/9783319414881).
This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, upcoming GPU Hackathons and Bootcamps, PEARC21 panel review , recent research, new resources and more!
The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. High-Level Synthesis (HLS) tools already provide an handy way to describe an FPGA-based hardware implementations starting from a software description of an algorithm. However, HLS directives allow to improve the hardware design only from a computational perspective, requiring a manual code restructuring in case memory transfer needs optimizing. This aspect limits the effectiveness of Design Space Exploration (DSE) approaches that only target HLS directives. Therefore, we present a comprehensive methodology to support the designer in the generation of optimal HLS-based hardware implementations. First, we propose an automated roofline model generation that directly operates on a C/C++ description of the target algorithm. The approach enables a fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we introduce a DSE methodology for quickly evaluating different HLS directives to identify an optimal implementation. We report the DSE performance when running on the PolyBench test suite, outperforming previous automated solutions in the literature. Finally, we illustrate the process of accelerating by means of our framework a complex application such as the N-body physics simulation algorithm, achieving results comparable to bespoke state-of-the-art implementations.
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, GCC, upcoming GPU Hackathons and Bootcamps, Sunita Chandrasekaran named as PI for SOLLVE Project, recent research and more!
Stay up-to-date with the OpenACC Monthly Highlights. June's edition covers the OpenACC Summit 2021, NVIDIA GTC'21 on-demand sessions, upcoming GPU Hackathons and Bootcamps, Intersect360 Research HPC market forecast, recent research, new resources and more!
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest performance per watt, rather than the highest performance or lowest power.
In this deck from the 2018 Rice Oil & Gas Conference, Doug Kothe from ORNL provides an update on the Exascale Computing Project.
“The quest to develop a capable exascale ecosystem is a monumental effort that requires the collaboration of government, academia, and industry. Achieving exascale will have profound effects on the American people and the world—improving the nation’s economic competitiveness, advancing scientific discovery, and strengthening our national security.”
Watch the video: https://wp.me/p3RLHQ-idv
Learn more: https://www.exascaleproject.org/
and
http://rice2018oghpc.rice.edu/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this video from the Argonne Training Program on Extreme-Scale Computing 2019, Jeffrey Vetter from ORNL presents: The Coming Age of Extreme Heterogeneity.
"In this talk, I'm going to talk about the high-level trends guiding our industry. Moore’s Law as we know it is definitely ending for either economic or technical reasons by 2025. Our community must aggressively explore emerging technologies now!"
Watch the video: https://wp.me/p3RLHQ-lic
Learn more: https://ft.ornl.gov/~vetter/
and
https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Overview of the Exascale Additive Manufacturing Projectinside-BigData.com
In this video from the HPC User Forum in Santa Fe, John Turner from ORNL presents: Overview of the Exascale Additive Manufacturing Project.
"Fully exploiting future exascale architectures will require a rethinking of the algorithms used in the large scale applications that advance many science areas vital to DOE and NNSA, such as global climate modeling, turbulent combustion in internal combustion engines, nuclear reactor modeling, additive manufacturing, subsurface flow, and national security applications. The newly established Center for Efficient Exascale Discretizations (CEED) in DOE’s Exascale Computing Project (ECP) aims to help these DOE/NNSA applications to take full advantage of exascale hardware by using state-of-the-art ‘high-order discretizations’ that provide an order of magnitude performance improvement over traditional methods."
Watch the video: http://wp.me/p3RLHQ-gHb
Learn more: https://exascaleproject.org/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
In this deck from the HPC User Forum in Austin, Yutaka Ishikawa from Riken AICS presents: Japan's post K Computer.
Watch the video presentation: http://wp.me/p3RLHQ-fJ6
Learn more: http://hpcuserforum.com
Paul Messina from Argonne presented this deck at the HPC User Forum in Santa Fe.
"The Exascale Computing Project (ECP) was established with the goals of maximizing the benefits of high-performance computing (HPC) for the United States and accelerating the development of a capable exascale computing ecosystem. Exascale refers to computing systems at least 50 times faster than the nation’s most powerful supercomputers in use today.The ECP is a collaborative effort of two U.S. Department of Energy organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA)."
Watch the video: http://insidehpc.com/2017/04/update-exascale-computing-project-ecp/
Learn more: https://exascaleproject.org/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum in Milwaukee, Bob Sorensen from Hyperion Research describes an ongoing study on the Development Trends of Next-Generation Supercomputers.
Project Requirements:
* Gather information on pre-exascale and exascale systems today and through 2028
* Concentrate on major HPC developer countries: US, China, EU, Japan, others?
* Build database of technical information on the research and development efforts on these next-generation machines
* Collect information on the flow of funding (amount from the country to the companies, etc.)
Hyperion Research is the new name for the former IDC high performance computing (HPC) analyst team. As Hyperion Research, we continue all the worldwide activities that spawned the world’s most respected HPC industry analyst group. For more than 25 years, we’ve helped IT professionals, business executives, and the investment community make fact-based decisions on technology purchases and business strategy.
Watch the video: https://wp.me/p3RLHQ-hlY
Learn more: http://www.hpcathyperion.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This lecture aims to give some food for thought regarding how the current High Performance Computing systems (hardware and software) tends to merge with Big Data ones (Machine Learning, Analytics and Enterprise workloads) in order to meet both workloads demands sharing the same clusters.
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...Hirofumi Iwasaki
For increasing business opportunity, the Financial industry companies requires the power, flexibility and scalability of latest enterprise technologies for its 24/7 services. Rakuten Card, one of the largest credit card companies in Japan, recently renewed their credit card core processing systems utilizing with Java EE. Among the myriad of available technologies, why did we choose Exalogic and Exadata, with Apache Spark distributed configuration? How did we ported from one of the oldest COBOL based mainframe in Japan? What were the key of the success factors into launching and operating this mission critical service? This session unveils our great results, and how our selections are effective for financial enterprise systems.
Computação de Alto Desempenho - Fator chave para a competitividade do País, d...Igor José F. Freitas
Vídeo: https://www.youtube.com/watch?v=8cFqNwhQ7uE
Fator chave para a competitividade do País, da Ciência e da Indústria.
Palestra ministrada durante o Intel Innovation Week 2015 .
Similar to 05 Preparing for Extreme Geterogeneity in HPC (20)
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
05 Preparing for Extreme Geterogeneity in HPC
1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Preparing for Extreme Heterogeneity
in High Performance Computing
Jeffrey S. Vetter
With many contributions from FTG Group and Colleagues
R-CCS International Symposium
Kobe
18 Feb 2020
ORNL is managed by UT-Battelle
for the US Department of Energy http://ft.ornl.gov vetter@computer.org
2. 1616
Highlights
• Recent trends in extreme-scale HPC paint an ambiguous future
– Contemporary systems provide evidence that power constraints are driving architectures to change rapidly
– Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
– Complexity is our main challenge
• Applications and software systems are all reaching a state of crisis
– Applications will not be functionally or performance portable across architectures
– Programming and operating systems need major redesign to address these architectural changes
– Procurements, acceptance testing, and operations of today’s new platforms depend on performance prediction and
benchmarking.
• We need portable programming models and performance prediction now more than ever!
– Heterogeneous processing
• OpenACC->FGPAs
• Intelligent runtime system (IRIS)
• Clacc – OpenACC support in LLVM (not covered today)
• OpenACC dialect of MLIR for Flang Fortran (not covered today)
– Emerging memory hierarchies (NVM)
• DRAGON – transparent NVM access from GPUs (not covered today)
• NVL-C – user management of nonvolatile memory in C (not covered today)
• Papyrus – parallel aggregate persistent storage (not covered today)
• Performance prediction is critical for design and optimization (not covered today)
4. 3030
History
Q: Think back 10 years. How
many of you would have
predicted that many of our
top HPC systems would be
GPU-based architectures?
Yes
No
Revisionists
5. 3232
Future
Q: Think forward 10 years.
How many of you predict
that most of our top HPC
systems will have the
following architectural
features?
Assume general purpose multicore CPU
GPU
FPGA/Reconfigurable processor
Neuromorphic processor
Deep learning processor
Quantum processor
RISC-V processor
Some new unknown processor
All/some of the above in one SoC
6. 3434
Implications
Q: Now, imagine you are building
a new application with an
expected ~3M LOC and 20 team
members over the next 10 years.
What on-node programming
model/system do you use?
C, C++ XX, Fortran XX
Metaprogramming, etc (e.g., AMP, Kokkos, RAJA, SYCL)
CUDA, cu***, HIP, OpenCL
Directives: OpenMP XX, OpenACC XX
R, Python, Matlab, etc
A Domain Specific Language (e.g., Claw, PySL)
A Domain Specific Framework (e.g., PetSc)
Some new unknown programming approach
All/some of the above
7. 35
The FTG Vision
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
8. 36
The FTG Vision | Applications
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
9. 37
National security
Stockpile
stewardship
Next-generation
electromagnetics
simulation of hostile
environment and
virtual flight testing for
hypersonic re-entry
vehicles
Energy security
Turbine wind plant
efficiency
High-efficiency,
low-emission
combustion engine
and gas turbine
design
Materials design for
extreme
environments of
nuclear fission
and fusion reactors
Design and
commercialization
of Small Modular
Reactors
Subsurface use
for carbon capture,
petroleum extraction,
waste disposal
Scale-up of clean
fossil fuel combustion
Biofuel catalyst
design
Scientific discovery
Find, predict,
and control materials
and properties
Cosmological probe
of the standard model
of particle physics
Validate fundamental
laws of nature
Demystify origin of
chemical elements
Light source-enabled
analysis of protein
and molecular
structure and design
Whole-device model
of magnetically
confined fusion
plasmas
Earth system
Accurate regional
impact assessments
in Earth system
models
Stress-resistant crop
analysis and catalytic
conversion
of biomass-derived
alcohols
Metagenomics
for analysis of
biogeochemical
cycles, climate
change,
environmental
remediation
Economic security
Additive
manufacturing
of qualifiable
metal parts
Reliable and
efficient planning
of the power grid
Seismic hazard
risk assessment
Urban planning
Health care
Accelerate
and translate
cancer research
ECP applications target national problems in 6 strategic areas
https://exascaleproject.org/
10. 38
DARPA Domain Specific System on Chip Program is investigating Performance
Portability of Software Defined Radio
• Signal processing: An open-
source implementation of IEEE-
802.11 WIFI a/b/g with GR OOT
modules.
• Input / Output file support via
Socket PDU (UDP server) blocks
• Image/Video transcoding with
OpenCL/OpenCV
Video/Image
Files
GR IEEE-802.11 Transmit (TX)
UDP
Antenna
UDP
IEEE-802.11 Receive (RX)
Xavier SoC #1 Xavier SoC #2
11. 39
The FTG Vision | Architectures
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
12. 46
Contemporary devices are approaching fundamental limits
I.L. Markov, “Limits on fundamental limits to computation,” Nature, 512(7513):147-54,
2014, doi:10.1038/nature13570.
Economist, Mar 2016
R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, “Design of ion-implanted
MOSFET's with very small physical dimensions,” IEEE Journal of Solid-State Circuits, 9(5):256-68, 1974,
Dennard scaling has already ended. Dennard observed that voltage and
current should be proportional to the linear dimensions of a transistor: 2x
transistor count implies 40% faster and 50% more efficient.
13. 4747
End of Moore’s Law : what’s your prediction ??
Economist, Mar 2016
“The number of people predicting the death of Moore’s Law doubles every two years.” – Peter Lee, Microsoft
15. 50
Sixth Wave of Computing
http://www.kurzweilai.net/exponential-growth-of-computing
Transition
Period
6th wave
16. 5151
Predictions for Transition Period
Optimize Software and
Expose New Hierarchical
Parallelism
• Redesign software to
boost performance on
upcoming
architectures
• Exploit new levels of
parallelism and
efficient data
movement
Architectural
Specialization and
Integration
• Use CMOS more
effectively for specific
workloads
• Integrate components
to boost performance
and eliminate
inefficiencies
• Workload specific
memory+storage
system design
Emerging Technologies
• Investigate new
computational
paradigms
• Quantum
• Neuromorphic
• Advanced Digital
• Emerging Memory
Devices
17. 5252
Predictions for Transition Period
Optimize Software and
Expose New Hierarchical
Parallelism
• Redesign software to
boost performance on
upcoming
architectures
• Exploit new levels of
parallelism and
efficient data
movement
Architectural
Specialization and
Integration
• Use CMOS more
effectively for specific
workloads
• Integrate components
to boost performance
and eliminate
inefficiencies
• Workload specific
memory+storage
system design
Emerging Technologies
• Investigate new
computational
paradigms
• Quantum
• Neuromorphic
• Advanced Digital
• Emerging Memory
Devices
18. 5353
Predictions for Transition Period
Optimize Software and
Expose New Hierarchical
Parallelism
• Redesign software to
boost performance on
upcoming
architectures
• Exploit new levels of
parallelism and
efficient data
movement
Architectural
Specialization and
Integration
• Use CMOS more
effectively for specific
workloads
• Integrate components
to boost performance
and eliminate
inefficiencies
• Workload specific
memory+storage
system design
Emerging Technologies
• Investigate new
computational
paradigms
• Quantum
• Neuromorphic
• Advanced Digital
• Emerging Memory
Devices
19. 5454
Quantum computing: Qubit design and fabrication
have made recent progress but still face challenges
Science 354, 1091 (2016) – 2 December
http://nap.edu/25196
20. 58
Fun Question: when was the field effect transistor patented?
https://www.edn.com/electronics-blogs/edn-moments/4422371/Lilienfeld-patents-field-effect-
transistor--October-8--1926
21. 5959
Predictions for Transition Period
Optimize Software and
Expose New Hierarchical
Parallelism
• Redesign software to
boost performance on
upcoming
architectures
• Exploit new levels of
parallelism and
efficient data
movement
Architectural
Specialization and
Integration
• Use CMOS more
effectively for specific
workloads
• Integrate components
to boost performance
and eliminate
inefficiencies
• Workload specific
memory+storage
system design
Emerging Technologies
• Investigate new
computational
paradigms
• Quantum
• Neuromorphic
• Advanced Digital
• Emerging Memory
Devices
22. 6060
https://www.thebroadcastbridge.com/content/entry/1094/altera-announces-arria-10-2666mbps-ddr4-memory-fpga-interface
Pace of Architectural Specialization is Quickening
• Industry, lacking Moore’s Law, will need to continue to
differentiate products (to stay in business)
– Use the same transistors differently to enhance performance
• Architectural design will become extremely important,
critical
– Dark Silicon
– Address new parameters for benefits/curse of Moore’s Law
• 50+ new companies focusing on hardware for Machine
Learning
http://www.wired.com/2016/05/google-tpu-custom-chips/
D.E. Shaw, M.M. Deneroff, R.O. Dror et al., “Anton, a special-purpose machine for molecular dynamics
simulation,” Communications of the ACM, 51(7):91-7, 2008.
http://www.theinquirer.net/inquirer/news/2477796/intels-nervana-
ai-platform-takes-aim-at-nvidias-gpu-techology
https://fossbytes.com/nvidia-volta-gddr6-2018/
Xilinx ACAP
HotChips 2018
HotChips 2018
23. 6262
Analysis of Apple A-* SoCs
http://vlsiarch.eecs.harvard.edu/accelerators/die-photo-analysis
24. 6767
Intel Stratix 10 FPGA
Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group
• Intel Stratix 10 FPGA and four banks of DDR4 external
memory
– Board configuration: Nallatech 520 Network Acceleration Card
• Up to 10 TFLOPS of peak single precision performance
• 25MBytes of L1 cache @ up to 94 TBytes/s peak
bandwidth
• 2X Core performance gains over Arria® 10
• Quartus and OpenCL software (Intel SDK v18.1) for
using FPGA
• Provide researcher access to advanced FPGA/SOC
environment
https://excl.ornl.gov/
Mar 2019
For more information or to apply for an account, visit https://excl.ornl.gov/
25. 68
NVIDIA Jetson AGX Xavier SoC
Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group
• NVIDIA Jetson AGX Xavier:
• High-performance system on a chip for autonomous
machines
• Heterogeneous SoC contains:
– Eight-core 64-bit ARMv8.2 CPU cluster (Carmel)
– 1.4 CUDA TFLOPS (FP32) GPU with additional
inference optimizations (Volta)
– 11.4 DL TOPS (INT8) Deep learning accelerator
(NVDLA)
– 1.7 CV TOPS (INT8) 7-slot VLIW dual-processor
Vision accelerator (PVA)
– A set of multimedia accelerators (stereo, LDC,
optical flow)
• Provides researchers access to advanced high-
performance SOC environment
https://excl.ornl.gov/
Mar 2019
For more information or to apply for an account, visit https://excl.ornl.gov/
29. 7272
Summary:
Transition Period will be Disruptive – Opportunities and Pitfalls Abound
• New devices and architectures may not
be hidden in traditional levels of
abstraction
• Examples
– A new type of CNT transistor may be
completely hidden from higher levels
– A new paradigm like quantum may require
new architectures, programming models, and
algorithmic approaches
Layer Switch, 3D NVM Approximate Neuro Quantum
Application 1 1 2 2 3
Algorithm 1 1 2 3 3
Language 1 2 2 3 3
API 1 2 2 3 3
Arch 1 2 2 3 3
ISA 1 2 2 3 3
Microarch 2 3 2 3 3
FU 2 3 2 3 3
Logic 3 3 2 3 3
Device 3 3 2 3 3
Adapted from IEEE Rebooting Computing Chart
30. 103103
LLNL
IBM/NVIDIA
Department of Energy (DOE) Roadmap to Exascale Systems
An impressive, productive lineup of accelerated node systems supporting DOE’s mission
ANL
IBM BG/Q
ORNL
Cray/AMD/NVIDIA
LBNL
Cray/AMD/NVIDIA
LANL/SNL
TBD
ANL
Intel/Cray
ORNL
AMD/Cray
LLNL
TBD
LANL/SNL
Cray/Intel Xeon/KNL
2012 2016 2018 2020 2021-2023
ORNL
IBM/NVIDIA
LLNL
IBM BG/Q
Sequoia (10)
Cori (12)
Trinity (6)
Theta (24)Mira (21)
Titan (9) Summit (1)
NERSC-9
Perlmutter
Aurora
ANL
Cray/Intel KNL
LBNL
Cray/Intel Xeon/KNL
First U.S. Exascale Systems
Sierra (2)
Pre-Exascale Systems [Aggregate Linpack (Rmax) = 323 PF!]
Jan 2018
Heterogeneous Cores
Deep Memory incl NVM
Plateauing I/O Performance
31. 118118
Frontier Continues the Accelerated Node Design
• Partnership between ORNL, Cray, and AMD
• The Frontier system will be delivered in 2021
• Peak Performance greater than 1.5 EF
• Composed of more than 100 Cray Shasta cabinets
– Connected by Slingshot™ interconnect with adaptive routing, congestion control,
and quality of service
• Accelerated Node Architecture:
– One purpose-built AMD EPYC™ processor
– Four HPC and AI optimized Radeon Instinct™ GPU accelerators
– Fully connected with high speed AMD Infinity Fabric links
– Coherent memory across the node
– 100 GB/s injection bandwidth
– Near-node NVM storage
32. 119
Comparison of Titan, Summit, and Frontier Systems
System Specs
Titan Summit Frontier
Peak 27 PF 200 PF ~1.5 EF
# cabinets 200 256 > 100
Node
1 AMD Opteron CPU
1 NVIDIA Kepler GPU
2 IBM POWER9™ CPUs
6 NVIDIA Volta GPUs
1 AMD EPYC CPU
4 AMD Radeon Instinct GPUs
On-node
interconnect
PCI Gen2
No coherence
across the node
NVIDIA NVLINK
Coherent memory
across the node
AMD Infinity Fabric
Coherent memory
across the node
System
Interconnect
Cray Gemini network
6.4 GB/s
Mellanox Dual-port EDR IB network
25 GB/s
Cray four-port Slingshot network
100 GB/s
Topology 3D Torus Non-blocking Fat Tree Dragonfly
Storage
32 PB, 1 TB/s, Lustre
Filesystem
250 PB, 2.5 TB/s, IBM Spectrum
Scale™ with GPFS™
2-4x performance and capacity
of Summit’s I/O subsystem.
On-node NVM No Yes Yes
Power 9 MV 13 MV 29 MV
34. 121
During this Sixth Wave transition, Complexity is our major challenge!
Design
How do we design future systems so
that they are better than current
systems on important applications?
• Simulation and modeling are more difficult
• Entirely possible that the new system will be
slower than the old system!
• Expect ‘disaster’ procurements
Programmability
How do we design applications with
some level of performance portability?
• Software lasts much longer than transient
hardware platforms
• Proper abstractions for flexibility and
efficiency
• Adapt or die
35. 124
The FTG Vision | Programming Systems
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
37. 131131
Directive-based Strategy with OpenARC: Open Accelerator
Research Compiler
• Open-Sourced, High-Level Intermediate
Representation (HIR)-Based, Extensible
Compiler Framework.
– Perform source-to-source translation from
OpenACC C to target accelerator models.
• Support full features of OpenACC V1.0 ( + array
reductions and function calls)
• Support both CUDA and OpenCL as target accelerator
models
– Provide common runtime APIs for various back-
ends
– Can be used as a research framework for various
study on directive-based accelerator computing.
• Built on top of Cetus compiler framework, equipped with
various advanced analysis/transformation passes and
built-in tuning tools.
• OpenARC’s IR provides an AST-like syntactic view of the
source program, easy to understand, access, and
transform the input program.
S. Lee and J.S. Vetter, “OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing,”
in ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC). Vancouver: ACM, 2014
38. 149149
FPGAs| Approach
• Design and implement an OpenACC-to-FPGA translation
framework, which is the first work to use a standard and portable
directive-based, high-level programming system for FPGAs.
• Propose FPGA-specific optimizations and novel pragma
extensions to improve performance.
• Evaluate the functional and performance portability of the
framework across diverse architectures (Altera FPGA, NVIDIA
GPU, AMD GPU, and Intel Xeon Phi).
S. Lee, J. Kim, and J.S. Vetter, “OpenACC to FPGA: A Framework for Directive-based High-Performance Reconfigurable Computing,” Proc. IEEE
International Parallel & Distributed Processing Symposium (IPDPS), 2016, 10.1109/IPDPS.2016.28.
39. 153
FPGA OpenCL Architecture
FPGA
Memory
Local Memory
Interconnect
Local Memory
Interconnect
Local Memory
Interconnect
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
PCIe
External Memory
Controller and PHY
External Memory
Controller and PHYHostProcessor
External DDR Memory External DDR Memory
Kernel
PipelineKernel
PipelineKernel
PipelineKernel
Pipeline
Kernel
PipelineKernel
PipelineKernel
PipelineKernel
Pipeline
Kernel
PipelineKernel
PipelineKernel
PipelineKernel
Pipeline
Pipeline
Depth
Vector
Width
Number of Replicated Compute Units
40. 156156
Kernel-Pipelining Transformation Optimization
• Kernel execution model in OpenACC
– Device kernels can communicate with
each other only through the device
global memory.
– Synchronizations between kernels are
at the granularity of a kernel
execution.
• Altera OpenCL channels
– Allows passing data between kernels
and synchronizing kernels with high
efficiency and low latency
Global Memory
Kernel 1 Kernel 2
Global Memory
Kernel 1 Kernel 2Channel
Kernel communications through
global memory in OpenACC
Kernel communications with
Altera channels
41. 157
Kernel-Pipelining Transformation Optimization (2)
#pragma acc data copyin (a) create (b) copyout (c)
{
#pragma acc kernels loop gang worker present (a, b)
for(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
#pragma acc kernels loop gang worker present (b, c)
for(i=0; i<N; i++) {c[i] = b[i]; }
}
channel float pipe_b;
__kernel void kernel1(__global float* a) {
int i = get_global_id(0);
write_channel_altera(pipe_b, a[i]*a[i]);
}
__kernel void kernel2(__global float* c) {
int i = get_global_id(0);
c[i] = read_channel_altera(pipe_b);
}
(a) Input OpenACC code
(b) Altera OpenCL code with channels
Global Memory
Kernel 1 Kernel 2
Global Memory
Kernel 1 Kernel 2Channel
42. 158
Kernel-Pipelining Transformation Optimization (3)
#pragma acc data copyin (a) create (b) copyout (c)
{
#pragma acc kernels loop gang worker present (a, b)
for(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
#pragma acc kernels loop gang worker present (b, c)
for(i=0; i<N; i++) {c[i] = b[i]; }
}
(a) Input OpenACC code
(c) Modified OpenACC code for kernel-pipelining
Global Memory
Kernel 1 Kernel 2
Global Memory
Kernel 1 Kernel 2Channel
#pragma acc data copyin (a) pipe (b) copyout (c)
{
#pragma acc kernels loop gang worker pipeout (b) present (a)
For(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
#pragma acc kernels loop gang worker pipein (b) present (c)
For(i=0; i<N; i++) {c[i] = b[i];}
}
Kernel-pipelining
transformation
Valid under
specific conditions
44. 175
Overall Performance of OpenARC FPGA Evaluation
FPGAs prefer applications with deep execution pipelines (e.g., FFT-1D and
FFT-2D), performing much higher than other accelerators.
For traditional HPC applications with abundant parallel floating-point operations,
it seems to be difficult for FPGAs to beat the performance of other accelerators,
even though FPGAs can be much more power-efficient.
• Tested FPGA does not contain dedicated, embedded floating-point
cores, while others have fully-optimized floating-point computation units.
Current and upcoming high-end FPGAs are equipped with hardened floating-
point operators, whose performance will be comparable to other accelerators,
while remaining power-efficient.
45. 187
The FTG Vision | Runtime and Operating Systems
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
46. 189
IRIS: Mapping Strategy for Heterogeneous Architectures and Native Programming
Models
ARM CPU
NVIDIA GPU
OpenACC
Intel FPGA
CUDA
OpenMP
Intel OpenCL
General
Accelerators
OpenCL
CPU/Xeon PhiOpenMP
SYCL
HIP AMD GPU
IRISCommonRuntimeAPI
IRIS offers a common API for diverse heterogeneous
devices and also allows intermixing of multiple
programming models (mix CUDA, OpenMP, OpenCL, etc.).
Support more
programming
models.
47. 190
IRIS: An Intelligent Runtime System for Extremely Heterogeneous
Architectures
• Provide programmers a unified programming
environment to write portable code across
heterogeneous architectures (and preferred
programming systems)
• Orchestrate diverse programming systems
(OpenCL, CUDA, HIP, OpenMP for CPU) in a single
application
– OpenCL
• NVIDIA GPU, AMD GPU, ARM GPU, Qualcomm GPU, Intel
CPU, Intel Xeon Phi, Intel FPGA, Xilinx FPGA
– CUDA
• NVIDIA GPU
– HIP
• AMD GPU
– OpenMP for CPU
• Intel CPU, AMD CPU, PowerPC CPU, ARM CPU,
Qualcomm CPU
48. 191191
The IRIS Architecture
• Platform Model
– A single-node system equipped with host CPUs
and multiple compute devices (GPUs, FPGAs,
Xeon Phis, and multicore CPUs)
• Memory Model
– Host memory + shared device memory
– All compute devices share the device memory
• Execution Model
– DAG-style task parallel execution across all
available compute devices
• Programming Model
– High-level OpenACC, OpenMP4, SYCL* (*
planned)
– Low-level C/Fortran/Python IRIS host-side
runtime API + OpenCL/CUDA/HIP/OpenMP
kernels (w/o compiler support)
49. 192
Supported Architectures and Programming Systems by IRIS
ExCL* Systems Oswald Summit-node Radeon Xavier Snapdragon
CPU Intel Xeon IBM Power9 Intel Xeon ARMv8
Qualcomm
Kryo
Programming Systems • Intel OpenMP
• Intel OpenCL
• IBM XL OpenMP • Intel OpenMP
• Intel OpenCL
• GNU GOMP • Android NDK
OpenMP
GPU NVIDIA P100 NVIDIA V100 AMD Radeon VII NVIDIA Volta
Qualcomm
Adreno 640
Programming Systems • NVIDIA CUDA
• NVIDIA OpenCL
• NVIDIA CUDA • AMD HIP
• AMD OpenCL
• NVIDIA CUDA • Qualcomm OpenCL
FPGA
Intel/Altera
Stratix 10
Programming Systems • Intel OpenCL
* ORNL Experimental Computing Laboratory (ExCL) https://excl.ornl.gov/
51. 194194
Task Scheduling in IRIS
• A task
– A scheduling unit
– Contains multiple in-order commands
• Kernel launch command
• Memory copy command (device-to-host, host-to-device)
– May have DAG-style dependencies with other tasks
– Enqueued to the application task queue with a device
selection policy
• Available device selection policies
– Specific Device (compute device #)
– Device Type (CPU, GPU, FPGA, XeonPhi)
– Profile-based
– Locality-aware
– Ontology-base
– Performance models (Aspen)
– Any, All, Random, 3rd-party users’ custom policies
• The task scheduler dispatches the tasks in the
application task queue to available compute devices
– Select the optimal target compute device according to
task’s device selection policy
52. 195195
SAXPY Example on Xavier
• Computation
– S[] = A * X[] + Y[]
• Two tasks
– S[] = A * X[] on NVIDIA GPU (CUDA)
– S[] += Y[] on ARM CPU (OpenMP)
• S[] is shared between two tasks
• Read-after-write (RAW), true dependency
• Low-level Python IRIS host code +
CUDA/OpenMP kernels
– saxpy.py
– kernel.cu
– kernel.openmp.h
53. 196
SAXPY: Python host code & CUDA kernel code
saxpy.py (1/2)
#!/usr/bin/env python
import iris
import numpy as np
import sys
iris.init()
SIZE = 1024
A = 10.0
x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)
print 'X', x
print 'Y', y
mem_x = iris.mem(x.nbytes)
mem_y = iris.mem(y.nbytes)
mem_s = iris.mem(s.nbytes)
saxpy.py (2/2)
kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem_x, iris.iris_r)
off = [ 0 ]
ndr = [ SIZE ]
task0 = iris.task()
task0.h2d_full(mem_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris_gpu)
kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem_s, iris.iris_rw)
kernel1.setmem(1, mem_y, iris.iris_r)
task1 = iris.task()
task1.h2d_full(mem_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
task1.submit(iris.iris_cpu)
print 'S =', A, '* X + Y', s
iris.finalize()
kernel.cu (CUDA)
extern "C" __global__ void saxpy0(float* S, float
A, float* X) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
S[id] = A * X[id];
}
extern "C" __global__ void saxpy1(float* S,
float* Y) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
S[id] += Y[id];
}
54. 197
SAXPY: Python host code & OpenMP kernel code
kernel.openmp.h (OpenMP)
#include <iris/iris_openmp.h>
static void saxpy0(float* S, float A, float* X,
IRIS_OPENMP_KERNEL_ARGS) {
int id;
#pragma omp parallel for shared(S, A, X)
private(id)
IRIS_OPENMP_KERNEL_BEGIN
S[id] = A * X[id];
IRIS_OPENMP_KERNEL_END
}
static void saxpy1(float* S, float* Y,
IRIS_OPENMP_KERNEL_ARGS) {
int id;
#pragma omp parallel for shared(S, Y) private(id)
IRIS_OPENMP_KERNEL_BEGIN
S[id] += Y[id];
IRIS_OPENMP_KERNEL_END
}
saxpy.py (1/2)
#!/usr/bin/env python
import iris
import numpy as np
import sys
iris.init()
SIZE = 1024
A = 10.0
x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)
print 'X', x
print 'Y', y
mem_x = iris.mem(x.nbytes)
mem_y = iris.mem(y.nbytes)
mem_s = iris.mem(s.nbytes)
saxpy.py (2/2)
kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem_x, iris.iris_r)
off = [ 0 ]
ndr = [ SIZE ]
task0 = iris.task()
task0.h2d_full(mem_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris_gpu)
kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem_s, iris.iris_rw)
kernel1.setmem(1, mem_y, iris.iris_r)
task1 = iris.task()
task1.h2d_full(mem_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
task1.submit(iris.iris_cpu)
print 'S =', A, '* X + Y', s
iris.finalize()
55. 198
Memory Consistency Management
saxpy.py (1/2)
#!/usr/bin/env python
import iris
import numpy as np
import sys
iris.init()
SIZE = 1024
A = 10.0
x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)
print 'X', x
print 'Y', y
mem_x = iris.mem(x.nbytes)
mem_y = iris.mem(y.nbytes)
mem_s = iris.mem(s.nbytes)
saxpy.py (2/2)
kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem_x, iris.iris_r)
off = [ 0 ]
ndr = [ SIZE ]
task0 = iris.task()
task0.h2d_full(mem_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris_gpu)
kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem_s, iris.iris_rw)
kernel1.setmem(1, mem_y, iris.iris_r)
task1 = iris.task()
task1.h2d_full(mem_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
task1.submit(iris.iris_cpu)
print 'S =', A, '* X + Y', s
iris.finalize()
mem_s is
shared between
GPU and CPU
56. 199199
Locality-aware Device Selection Policy
saxpy.py (1/2)
#!/usr/bin/env python
import iris
import numpy as np
import sys
iris.init()
SIZE = 1024
A = 10.0
x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)
print 'X', x
print 'Y', y
mem_x = iris.mem(x.nbytes)
mem_y = iris.mem(y.nbytes)
mem_s = iris.mem(s.nbytes)
saxpy.py (2/2)
kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem_x, iris.iris_r)
off = [ 0 ]
ndr = [ SIZE ]
task0 = iris.task()
task0.h2d_full(mem_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris_gpu)
kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem_s, iris.iris_rw)
kernel1.setmem(1, mem_y, iris.iris_r)
task1 = iris.task()
task1.h2d_full(mem_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
task1.submit(iris.iris_data)
print 'S =', A, '* X + Y', s
iris.finalize()
iris_data selects
the device that
requires
minimum data
transfer to
execute the
task
57. 201
The FTG Vision
Architectures
Multicore CPU GPU FPGA AI Accelerator SoC DSP Deep Memory
Persistent
Memory
Neuromorphic
Applications
Science and
Engineering (e.g., CFD,
Materials, Fusion)
Streaming
(e.g., SW Radio,
Experimental instrument)
Sensing
(e.g., SAR, vision)
Deep learning
(e.g., CNN)
Analytics
(e.g., graphs)
Robotics
(e.g., sense and react)
Programming Systems
Compiler
Domain Specific
Languages
Just-in-time
Compilation
Metaprogramming Scripting Libraries Autotuning
Runtime and Operating Systems
Discovery
Task Scheduling
and Mapping
Data Orchestration IO Synchronization Load balancing
Performance
Productivity
EnergyEfficiency
IRIS
OpenARC
58. 204204
Recap
• Motivation: Recent trends in computing
paint an ambiguous future
– Multiple architectural dimensions are being
(dramatically) redesigned: Processors, node
design, memory systems, I/O
– Complexity is our main challenge
• Applications and software systems across
many areas are all reaching a state of crisis
– Need a focus on performance portability
• ORNL FTG investigating design and
programming challenges for these trends
– Performance modeling and ontologies
– Performance portable compilation to many
different heterogeneous architectures/SoCs
– Intelligent scheduling system to automate
discovery, device selection, and data movement
– Targeting wide variety of existing and future
architectures (DSSoC and others)
• Visit us
– We host interns and other visitors year
round
• Faculty, grad, undergrad, high school,
industry
• Jobs in FTG
– Postdoctoral Research Associate in
Computer Science
– Software Engineer
– Computer Scientist
– Visit https://jobs.ornl.gov
• Contact me vetter@ornl.gov
59. 205205
Final Report on Workshop on Extreme Heterogeneity
1. Maintaining and improving programmer productivity
– Flexible, expressive, programming models and languages
– Intelligent, domain-aware compilers and tools
– Composition of disparate software components
• Managing resources intelligently
– Automated methods using introspection and machine learning
– Optimize for performance, energy efficiency, and availability
• Modeling & predicting performance
– Evaluate impact of potential system designs and application mappings
– Model-automated optimization of applications
• Enabling reproducible science despite non-determinism & asynchrony
– Methods for validation on non-deterministic architectures
– Detection and mitigation of pervasive faults and errors
• Facilitating Data Management, Analytics, and Workflows
– Mapping of science workflows to heterogeneous hardware and software services
– Adapting workflows and services to meet facility-level objectives through learning approaches
https://orau.gov/exheterogeneity2018/ https://doi.org/10.2172/1473756