The document summarizes a presentation about AMD's new "Zen" x86 CPU core architecture. The Zen architecture provides a 40% increase in instructions per clock compared to previous cores through improvements in the core engine, caches, floating point capabilities, and the addition of simultaneous multithreading. The Zen core was designed from the ground up to optimize performance and power efficiency across applications from notebooks to supercomputers.
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
Presentation of a paper accepted in Supercomputing Frontiers Asia 2022
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
MIPI DevCon 2021: Meeting the Needs of Next-Generation Displays with a High-P...MIPI Alliance
Presented by Alain Legault, Hardent Inc.; Joe Rodriguez, Rambus Inc.; and Justin Endo, Mixel, Inc.
Next-generation display applications have an insatiable appetite for bandwidth. Using a combination of VESA Display Stream Compression (DSC) and MIPI DSI-2℠ technology, designers can achieve display resolutions up to 8K without compromise to video quality, battery life or cost. This presentation discusses a fully integrated, off-the-shelf display IP subsystem solution, consisting of Mixel (MIPI C-PHY℠/D-PHY℠ combo), Rambus (MIPI DSI-2® controller) and Hardent (VESA DSC) IP, that can deliver this state-of-the-art performance in a power-efficient and compact footprint.
During the CXL Forum at OCP Global Summit 23, Rick Kutcipal and Sreeni Bagalkote of Broadcom presented their PCIe/CXL Roadmap and announced their Atlas 4 CXL switch.
Facebook presented, "Chiplets in Data Centers," at the ODSA Workshop. The charter of the ODSA (Open Domain Specification Architecture) Workgroup is to define an open specification that enables building of Domain Specific Accelerator silicon using best-of-breed components from the industry made available as chiplet dies that can be integrated together as Lego blocks on an organic substrate packaging layer. The resulting multi-chip module (MCM) silicon can be produced at significantly lower development and manufacturing costs, and will deliver much needed performance per watt and performance per dollar efficiencies in networking, security, machine learning and other applications. The ODSA Workgroup also intends to deliver implementations of the specification as board-level prototypes, RTL code and libraries.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
Dustin Franklin (GPGPU Applications Engineer, GE Intelligent Platforms ) presents:
"GPUDirect support for RDMA provides low-latency interconnectivity between NVIDIA GPUs and various networking, storage, and FPGA devices. Discussion will include how the CUDA 5 technology increases GPU autonomy and promotes multi-GPU topologies with high GPU-to-CPU ratios. In addition to improved bandwidth and latency, the resulting increase in GFLOPS/watt poses a significant impact to both HPC and embedded applications. We will dig into scalable PCIe switch hierarchies, as well as software infrastructure to manage device interopability and GPUDirect streaming. Highlighting emerging architectures composed of Tegra-style SoCs that further decouple GPUs from discrete CPUs to achieve greater computational density."
Learn more at: http://www.gputechconf.com/page/home.html
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
AMD's EPYC Architecture has paved the way forward towards Heterogeneous Data Centric Computing, but it is still limited by it's parallel DDR interfaces. This presentation shows the potential for the EPYC architecture if it adopted the Open Memory Interface, OMI, for it's Near Memory interface.
In this deck from the HPC User Forum in Tucson, Jeff Stuecheli from IBM presents: POWER9 for AI & HPC.
"Built from the ground-up for data intensive workloads, POWER9 is the only processor with state-of-the-art I/O subsystem technology, including next generation NVIDIA NVLink, PCIe Gen4, and OpenCAPI."
Watch the video: https://wp.me/p3RLHQ-isJ
Learn more: https://www.ibm.com/it-infrastructure/power/power9
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
Presentation of a paper accepted in Supercomputing Frontiers Asia 2022
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
MIPI DevCon 2021: Meeting the Needs of Next-Generation Displays with a High-P...MIPI Alliance
Presented by Alain Legault, Hardent Inc.; Joe Rodriguez, Rambus Inc.; and Justin Endo, Mixel, Inc.
Next-generation display applications have an insatiable appetite for bandwidth. Using a combination of VESA Display Stream Compression (DSC) and MIPI DSI-2℠ technology, designers can achieve display resolutions up to 8K without compromise to video quality, battery life or cost. This presentation discusses a fully integrated, off-the-shelf display IP subsystem solution, consisting of Mixel (MIPI C-PHY℠/D-PHY℠ combo), Rambus (MIPI DSI-2® controller) and Hardent (VESA DSC) IP, that can deliver this state-of-the-art performance in a power-efficient and compact footprint.
During the CXL Forum at OCP Global Summit 23, Rick Kutcipal and Sreeni Bagalkote of Broadcom presented their PCIe/CXL Roadmap and announced their Atlas 4 CXL switch.
Facebook presented, "Chiplets in Data Centers," at the ODSA Workshop. The charter of the ODSA (Open Domain Specification Architecture) Workgroup is to define an open specification that enables building of Domain Specific Accelerator silicon using best-of-breed components from the industry made available as chiplet dies that can be integrated together as Lego blocks on an organic substrate packaging layer. The resulting multi-chip module (MCM) silicon can be produced at significantly lower development and manufacturing costs, and will deliver much needed performance per watt and performance per dollar efficiencies in networking, security, machine learning and other applications. The ODSA Workgroup also intends to deliver implementations of the specification as board-level prototypes, RTL code and libraries.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
Dustin Franklin (GPGPU Applications Engineer, GE Intelligent Platforms ) presents:
"GPUDirect support for RDMA provides low-latency interconnectivity between NVIDIA GPUs and various networking, storage, and FPGA devices. Discussion will include how the CUDA 5 technology increases GPU autonomy and promotes multi-GPU topologies with high GPU-to-CPU ratios. In addition to improved bandwidth and latency, the resulting increase in GFLOPS/watt poses a significant impact to both HPC and embedded applications. We will dig into scalable PCIe switch hierarchies, as well as software infrastructure to manage device interopability and GPUDirect streaming. Highlighting emerging architectures composed of Tegra-style SoCs that further decouple GPUs from discrete CPUs to achieve greater computational density."
Learn more at: http://www.gputechconf.com/page/home.html
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
AMD's EPYC Architecture has paved the way forward towards Heterogeneous Data Centric Computing, but it is still limited by it's parallel DDR interfaces. This presentation shows the potential for the EPYC architecture if it adopted the Open Memory Interface, OMI, for it's Near Memory interface.
In this deck from the HPC User Forum in Tucson, Jeff Stuecheli from IBM presents: POWER9 for AI & HPC.
"Built from the ground-up for data intensive workloads, POWER9 is the only processor with state-of-the-art I/O subsystem technology, including next generation NVIDIA NVLink, PCIe Gen4, and OpenCAPI."
Watch the video: https://wp.me/p3RLHQ-isJ
Learn more: https://www.ibm.com/it-infrastructure/power/power9
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today Fujitsu published specifications for the A64FX CPU to be featured in the post-K computer, a future machine designed to be 100 times faster than the legendary K computer that dominated the TOP500 for years.
A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance. A64FX offers a number of features, including broad utility supporting a wide range of applications, massive parallelization through the Tofu interconnect, low power consumption, and mainframe-class reliability.
A64FX is the world's first CPU to adopt the SVE of Arm Limited's Armv8-A instruction set architecture, extended for supercomputers. Fujitsu collaborated with Arm, contributing to the development of the SVE as a lead partner, and adopted the results in the A64FX.
Learn more: https://wp.me/p3RLHQ-iYt
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
IBM Power9 Servers are here! Launched this week, the AC922 POWER9 servers will form the basis of the world’s fastest “Coral” supercomputers coming to ORNL and LLNL. Built specifically for compute-intensive AI workloads, the new POWER9 systems are capable of improving the training times of deep learning frameworks by nearly 4x allowing enterprises to build more accurate AI applications, faster.
Listen to the Radio Free HPC podcast on Power9: https://insidehpc.com/2017/12/radio-free-hpc-looks-new-power9-titan-v-snapdragon-845/
Learn more: https://www.ibm.com/us-en/marketplace/power-systems-ac922
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
Power ISA processors have a long history of offering superior features for HPC applications. Well known examples include POWER3, used in the ASCI White supercomputer, various PowerPC processors used in the Blue Gene family of massively parallel computers, and POWER9, present in the leading supercomputers of today, Summit and Sierra. OpenPOWER ISA has enabled open access to many of these features. IBM's most recent contribution to OpenPOWER ISA, in the form of Power ISA Version 3.1, includes the Matrix-Multiply Assist (MMA) instructions. The MMA instructions are designed to deliver additional performance both for classical high-performance computing, in the space of scientific and technical computing, and for the increasingly important space of business analytics. In addition, the Open Memory Interface (OMI), also developed by IBM, opens new levels of memory bandwidth and capacity for the most demanding applications. Our goal is to raise awareness of and interest in these new features, which we believe can lead to further research in processor architecture and programming environments. Some of the most promising application areas include graph algorithms, classical machine learning and deep learning.
Heterogeneous Computing : The Future of SystemsAnand Haridass
Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
Race to Reality: The Next Billion-People Market OpportunityAMD
On September 3rd, 2016 at IFA Berlin, Mark Papermaster, Chief Technology Officer AMD provided unique insights into the new era of Virtual Reality: "Race to Reality - The Next Billion-People Market Opportunity”.
GPU compute has leveraged discrete GPUs for a fairly limited set of academic and supercomputing system workloads until recently. With the increase in performance of integrated GPU inside an Accelerated Processing Unit (APU), introduction of Heterogeneous System Architecture (HSA) devices, and proliferation of programming tools, we are seeing GPU compute make its way into mainstream applications. In this presentation we cover GPU compute and HSA, focusing on the application of GPU compute in the Medical and Print Imaging segments. Examples of performance data are reviewed and the case is made for how GPU compute can deliver tangible benefits.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
1. 1 | HOT CHIPS 28 | AUGUST 23, 2016
MIKE CLARK
SENIOR FELLOW
A NEW X86 CORE
ARCHITECTURE FOR THE NEXT
GENERATION OF COMPUTING
2. 2 | HOT CHIPS 28 | AUGUST 23, 2016
AGENDA
THE ROAD TO ZEN
HIGH LEVEL ARCHITECTURE
‐ IMPROVEMENTS IN CORE ENGINE
‐ FLOATING POINT
‐ IMPROVEMENTS IN CACHE SYSTEM
‐ SMT DESIGN TO MAXIMIZE THROUGHPUT
‐ NEW ISA EXTENSIONS
SUMMARY
NEXT STEP UP
3. 3 | HOT CHIPS 28 | AUGUST 23, 2016
AMD X86 CORES: DRIVING COMPETITIVE PERFORMANCE
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
INSTRUCTIONSPERCLOCK
More Instructions
Per Clock*
40%“Excavator”
Core
“Bulldozer”
Core
“ZEN”
4. 4 | HOT CHIPS 28 | AUGUST 23, 2016
AMD CPU DESIGN OPTIMIZATION POINTS
ONE CORE FROM FANLESS NOTEBOOKS
TO SUPERCOMPUTERS
“ZEN”
PERFORMANCE
LOWER POWER
SMALLER MORE AREA, POWER
MORE PERFORMANCE
Low Power
“Jaguar”
High Performance
“Excavator”
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
5. 5 | HOT CHIPS 28 | AUGUST 23, 2016
DEFYING CONVENTION: A WIDE, HIGH PERFORMANCE, EFFICIENT CORE
At = Energy
Per Cycle
+40%
work per
cycle*
Total
Efficiency
Gain
“ZEN”
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
Instructions-Per-Clock
Energy Per Cycle
“Steamroller”
“Piledriver”
“Bulldozer”
“Excavator”
6. 6 | HOT CHIPS 28 | AUGUST 23, 2016
ZEN PERFORMANCE & POWER IMPROVEMENTS
LOWER POWER
‐ Aggressive clock gating with
multi-level regions
‐ Write back L1 cache
‐ Large Op Cache
‐ Stack Engine
‐ Move elimination
‐ Power focus from project inception
‐ Low Power Design Methodologies
BETTER CACHE SYSTEM
‐ Write back L1 cache
‐ Faster L2 cache
‐ Faster L3 cache
‐ Faster Load to FPU: 7 vs. 9 cycles
‐ Better L1 and L2 data prefetcher
‐ Close to 2x the L1 and L2 bandwidth
‐ Total L3 bandwidth up 5x
BETTER CORE ENGINE
‐ Two threads per core
‐ Branch mispredict improved
‐ Better branch prediction with 2
branches per BTB entry
‐ Large Op Cache
‐ Wider micro-op dispatch 6 vs. 4
‐ Larger Instruction Schedulers
Integer: 84 vs. 48 | FP: 96 vs. 60
‐ Larger retire 8 ops vs. 4 ops
‐ Quad issue FPU
‐ Larger Retire Queue 192 vs. 128
‐ Larger Load Queue 72 vs. 44
‐ Larger Store Queue 44 vs. 32
40% IPC PERFORMANCE UPLIFT
7. 7 | HOT CHIPS 28 | AUGUST 23, 2016
Decode
4 instructions/cycle
512K
L2 (I+D) Cache
8 Way
ADD MUL ADDMULALU
2 loads + 1 store
per cycle
6 ops dispatched
Op Cache
INTEGER FLOATING POINT
ALU ALU ALU
Micro-op Queue
64K I-Cache
4 way
Branch Prediction
AGUAGU
Load/Store
Queues
Integer Physical Register File
32K D-Cache
8 Way
FP Register File
Integer Rename Floating Point Rename
Scheduler Scheduler Scheduler Scheduler SchedulerScheduler Scheduler
Fetch Four x86 instructions
Op Cache instructions
4 Integer units
‒ Large rename space – 168 Registers
‒ 192 instructions in flight/8 wide retire
2 Load/Store units
‒ 72 Out-of-Order Loads supported
2 Floating Point units x 128 FMACs
‒ built as 4 pipes, 2 Fadd, 2 Fmul
I-Cache 64K, 4-way
D-Cache 32K, 8-way
L2 Cache 512K, 8-way
Large shared L3 cache
2 threads per core
ZEN MICROARCHITECTURE
Micro-ops
8. 8 | HOT CHIPS 28 | AUGUST 23, 2016
Decoupled Branch Prediction
TLB in the BP pipe
‒ 8 entry L0 TLB, all page sizes
‒ 64 entry L1 TLB, all page sizes
‒ 512 entry L2 TLB, no 1G pages
2 branches per BTB entry
Large L1 / L2 BTB
32 entry return stack
Indirect Target Array (ITA)
64K, 4-way Instruction cache
Micro-tags for IC & Op cache
32 byte fetch
FETCHNext PC
64K Instruction Cache
To Op
Cache
32 bytes to Decode
32 bytes/
cycle
from L2
Redirect
from DE/EX
Physical Request Queue Micro-Tags
L1/L2 BTB
Return Stack ITA
Hash Perceptron
L0/L1/L2 TLB
9. 9 | HOT CHIPS 28 | AUGUST 23, 2016
Inline Instruction-length Decoder
Decode 4 x86 instructions
Op cache
Micro-op Queue
Stack Engine
Branch Fusion
Memory File for Store to Load Forwarding
DECODE
Instruction Byte Buffer
Pick
Dispatch
From IC From Micro Tags
Instructions
To EX, 6 Micro-ops
To FP,
4 Micro-ops
Microcode Rom Stack Engine Memfile
Micro-op Queue
Decode
Op Cache
Micro-ops
11. 11 | HOT CHIPS 28 | AUGUST 23, 2016
72 Out of Order Loads
44 entry Store Queue
Split TLB/Data Pipe, store pipe
64 entry L1 TLB, all page sizes
1.5K entry L2 TLB, no 1G pages
32K, 8 way Data Cache
‒ Supports two 128-bit accesses
Optimized L1 and L2 Prefetchers
512K, private (2 threads), inclusive L2
LOAD/STORE AND L2
Load Queue
TLB0 TLB1
32K Data Cache
32 bytes to/from L2
AGU0 AGU1 To Ex
To FP
To L2
MAB
Store Pipe Pick
L1 PickL0 Pick
Pre Fetch
Store Queue
STP
L1/L2 TLB +
DC tags
DAT0 DAT1 Store Commit
WCB
12. 12 | HOT CHIPS 28 | AUGUST 23, 2016
2 Level Scheduling Queue
160 entry Physical Register File
8 Wide Retire
1 pipe for 1x128b store
Accelerated Recovery on Flushes
SSE, AVX1, AVX2, AES, SHA, and legacy
mmx/x87 compliant
2 AES units
FLOATING POINT
160 Entry Physical Register File
Forwarding Muxes
NSQ
192 Entry
Retire Queue
4 Micro-op Dispatch
8 Micro-op
Retire
ADD0 ADD1MUL1MUL0
SQ LDCVT
128 bit
Loads
Int to FP
FP to Int
13. 13 | HOT CHIPS 28 | AUGUST 23, 2016
Fast private 512K L2 cache
Fast shared L3 cache
High bandwidth enables prefetch
improvements
L3 is filled from L2 victims
Fast cache-to-cache transfers
Large Queues for Handling L1 and L2
misses
ZEN CACHE HIERARCHY
32B/cycle
32B fetch 32B/cycle
CORE 0
32B/cycle
32B/cycle
2*16B load
1*16B store
8M L3
I+D
Cache
16-way
512K L2
I+D
Cache
8-way
32K
D-Cache
8-way
64K
I-Cache
4-way
14. 14 | HOT CHIPS 28 | AUGUST 23, 2016
A CPU complex (CCX) is four
cores connected to an
L3 Cache.
The L3 Cache is 16-way
associative, 8MB, mostly
exclusive of L2.
The L3 Cache is made of 4
slices, by low-order address
interleave.
Every core can access every
cache with same average
latency
CPU COMPLEX
CORE 3
CORE 1L3M
1MB
L
3
C
T
L
L
2
C
T
L
L2M
512K
L3M
1MB
CORE 3L3M
1MB
L
3
C
T
L
L
2
C
T
L
L2M
512K
L3M
1MB
CORE 0 L3M
1MB
L
3
C
T
L
L
2
C
T
L
L2M
512K
L3M
1MB
CORE 2 L3M
1MB
L
3
C
T
L
L
2
C
T
L
L2M
512K
L3M
1MB
15. 15 | HOT CHIPS 28 | AUGUST 23, 2016
INTEGER FLOATING
POINT
Integer Physical Register File FP Register File
Decode
512K
L2 (I+D) Cache
8 Way
ADD MUL ADDMUL4x ALUs
Micro-op Cache
Micro-op Queue
64K I-Cache 4 way Branch Prediction
Integer Rename Floating Point Rename
Schedulers Scheduler
L0/L1/L2
ITLB
Retire Queue
2x AGUs
Load Queue
Store Queue 32K D-Cache
8 Way
L1/L2
DTLBLoad Queue
Store Queue 512K
L2 (I+D) Cache
8 Way
MUL ADD
32K D-Cache
8 Way
L1/L2
DTLB
Integer Physical Register File
Integer Rename
Schedulers
FP Register File
Floating Point Rename
Scheduler
64K I-Cache 4 way
Decode
instructions
ADDMUL4x ALUs
Vertically Threaded
Op-Cache
micro-ops
Micro-op Queue
Branch Prediction
All structures fully available in 1T mode
Front End Queues are round robin with
priority overrides
Increased throughput from SMT
SMT OVERVIEW
L0/L1/L2
ITLB
Competitively shared structures
Statically Partitioned
Competitively shared with Algorithmic Priority
Competitively shared and SMT Tagged
Retire Queue
6 ops dispatched
2x AGUs
16. 16 | HOT CHIPS 28 | AUGUST 23, 2016
NEW INSTRUCTIONS
Feature Notes Excavator Zen
ADX Extending multi-precision arithmetic support
RDSEED Complement to RDRAND random number generation
SMAP Supervisor Mode Access Prevention
SHA1/SHA256 Secure Hash Implementation Instructions
CLFLUSHOPT CLFLUSH ordered by SFENCE
XSAVEC/XSAVES/XRSTORS New Compact and Supervisor Save/Restore
CLZERO Clear Cache Line
PTE Coalescing Combines 4K page tables into 32K page size
We support all the standard ISA including AVX &AVX-2, BMI1 & BMI2, AES, RDRAND, SMEP
AMD Exclusive
17. 17 | HOT CHIPS 28 | AUGUST 23, 2016
“ZEN”
Totally New
High-performance
Core Design
DESIGNED FROM THE GROUND
UP FOR OPTIMAL BALANCE OF
PERFORMANCE AND POWER
Simultaneous
Multithreading (SMT)
for High Throughput
Energy-efficient FinFET
Design Scales from
Enterprise to Client
Products
New High-Bandwidth,
Low Latency Cache
System
18. 18 | HOT CHIPS 28 | AUGUST 23, 2016
A COMMITTED ROADMAP TO X86 PERFORMANCE
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
INSTRUCTIONSPERCLOCK
More Instructions
Per Clock*
40%“Excavator”
Core
“Bulldozer”
Core
“ZEN”
“ZEN+”