Welcome to our Sunday session of AMD Next Horizon Gaming, and thank you for joining us
We are in the heart of the entertainment industry here in LA. This is where so many experiences are imagined and created in the studios. What better place for us to unveil what is at the heart of our new 3rd Generation Ryzen and Navi products And the game changing experiences that can create.
we can’t wait to share details behind these products,… and no better place to start than right at the center …our Central Processor Unit …and our CPU roadmap journey
This journey is all about High Performance
Over 5 years ago we launched the design effort of the new Zen processor, leveraging 14nm FinFet transistor, and an incredible 52% IPC increase over its predecessor. This brought us back to high performance.
At that time, I committed to strong roadmap execution and both a current and next generation CPU team that would be leapfrogging one another to ensure a continuous flow of innovation and performance gains to our CPU roadmap. We remain committed and have a maniacal focus on execution of the roadmap
Zen rolled out our high performance x86 products across consumer and commercial client Increased client performance with the introduction of a 12nm “kicker” to further improve upon 14nm Zen2 in 7nm rolling out NOW as promised. Huge focus on engineering a solution that could incorporate the latest in semiconductor technology with the 7nm node, but implement is creatively such that it would be easy to manufacture and ship on time. Our next generation beyond that , Zen3 is fully on track at design completion phase. We will each generation combine process and design to driving performance forward on or ahead of historical norms. We called this play. Our roadmap is stable and we are executing. This trajectory keeps AMD firmly positioned for high performance leadership in the most demanding applications for the future product generations Moreover, We have become a fully “bankable” supplier of x86 performance to the industry Let’s take a quick look at the products we’ve brought to market with Zen
How has Zen been received in the market? Resounding success more than 100 million Zen CORES have shipped in Ryzen alone
Flexibility and modularity of the Zen core, allowed us to span it across multiple products and markets The infinity fabric not only connecting the Zen CPUs in Ryzen and EPYC , but the integrated CPU and Radeon GPU in Ryzen Mobile As such, AMD could leverage the single “Zen” architecture to create products that span desktops and laptops for both commercial and consumer use, as well as servers. Zen products been very well accepted in the market Zen”-based products have been highly recognized in the market and the Ryzen product family has won more than 700 awards globally since the initial launch in 2017, Ryzen raised the roof on high performance desktop including our recently announced Threadripper2 which brings 32 cores to high performance desktop. Ryzen mobile has married on a single die the performance leadership of Zen with Vega class graphics which can run AAA games Our EPYC server is creating excellent datacenter momentum We accomplished the scalability we set out with our prior generation Let’s shift now to our next generation Zen and how at the 7nm process is a great enabler
Zen brought AMD back to high performance, Zen2 was about taking the next step in leadership performance. Combined new process technology with new design…improving every execution unit for more performance
When the first Zen 2 products come to market they will be the world’s first x86 7nm high performance CPUs We also improved the CPU design to delivery more computing performance - even more security enhancements with Zen 2 And we amped up our interconnect, creating and next generation of our versatile infinity fabric pushing both performance and configurability
7nm was a tough lift, it’s a challenging node but a full node jump from where we were initially with Zen in 14nm. 7nm required significant investment We have over five 7nm products taped out, and they are all doing well Very happy with 7nm – but there was no magic…lots of hard, prep work and collaboration We have to work almost flawlessly with TSMC, and EDA partners. This was a huge aspect of our success, because to get the benefit of 7nm your design has to account for higher resistances to connect devices, much tighter regulations about the implementation of the transistors to be sure they are manufacturable.
Leveraging significant product ramp from industry mobile phone customers to mature the process.
7nm could enable half the die size higher perf/watt and half the power vs 14nm …but that is not what our customers want.
Customers want more performance in that same envelope of power that they already have designed to computers to handle. This is where 7nm delivers.
After the huge gain of 52% IPC, I was often asked if AMD used up all the tricks in the book …or did we keep any levers at the ready for future generations. Happy to say we have no lack of great ideas to keep our Zen roadmap running ahead of the historical pace of IPC generational improvement
We hit all the levers : branch prediction, integer execution unit, floating point, and memory efficiencies
When you enhance execution you have to focus on the details and we made a numberchanges of front end execution pipeline in Zen and we haven’t stopped Improved branch prediction capabilities, being able to correctly pre-fetch theinstruction stream to lower latency, as well as improving cache instruction Our TAGE branch predictor uses multiple tables and branch address bits to build a better conditional predictor. Enhanced instruction prefetching works collaboratively to make sure right instructions are in the cache to be delivered. Computer Architecture is always about finding the right balance of resources. Zen 2 rebalances the Branch predictor, instructions cache and op cache to deliver more instructions per cycle to the machine (“feed the machine”)
Helps with IPC improvements and other things Went from 6->7 on issue rate Increasing Loads and Stores to 256 bits increases the speed at which the processor can move data. Since we can do 2 Loads and 1 Store to the cache per cycle, adding the 3rd Address Generation Unit allows us to sustain the maximum bandwidth. Our instruction schedulers improved their operation picking ability to balance performance between the 2 threads. Increasing the dispatch instruction width allows us to feed more instructions in to the machine per cycle
Zen delivered leadership efficiency in floating point performance as evidenced by the success in HPC applications Zen2 Doubles width and bandwidth feeding it, so a true doubling of floating point, key for workloads like fluid dynamics analysis, Weather Forecasting, Molecular Dynamics, Raytracing With the added density of 7nm, widening the floating point from 128 bit to 256 width allows for greater efficiency in vectorized workloads without taking more area or power With the wider floating point data path, we also increased the dispatch and retire bandwidth such that the non-vectorized code can also be handled efficiently. Also, to feed that data path, we increased the cache interface to 256 bits This results in real world application performance. A DGEMM (or double precision matrix math) will truly achieve a doubling on Zen2 core. Keeping a balanced design delivers true application performance and the Zen2 floating point will be an industry disruptor
Caches are critical to keep the data needed close to the execution units, and we took a big step in doubling the size of the Level 3 cache. This increases the rate at which the data is local and effectively reduces the latency to memory. The new cache instructions
Previously we have had instructions that when you push data out of the CPU caches, it invalidates the CPU cache as well. We have added 2 instructions that allows the CPU to push the dirty data out of the cache while still retaining a clean copy in the cache. This can be useful when the cache contains a combination of CPU data and data to be communicated to the GPU. The GPU data can be forced to memory while the CPU can still retain and hit on its data in the cache. We also added a Quality of Service (QOS) mechanism for both the L3 cache and memory. This allows the software to both monitor and control how different software threads are interacting with the hardware. The operating system can identify threads that it prefers to have more L3 cache and bandwidth and allocate that for them as well as it can provide physical isolation between threads to avoid a “noisy neighbor” problem. All 6-core parts: 35 MB L2+L3 All 8-core parts: 36 MB L2+L3 12 Core: 70 MB L2+L3
From Mahesh Zen was impressive, but Zen+ and Zen2 delivered consistent and aggressive performance gains. Zen, Zen+, Zen2 shows 1T CBr20 performance gains of 9% (Zen to Zen+) and 21% (Zen+ to Zen2), for an overall effective performance increase of +32% over two years (2018-2019). The 9% increase from Zen to Zen+ was split as 2% IPC and 7% Fmax improvements (Design + Foundry) The 21% increase from Zen+ to Zen2 is split as 13% IPC and 8% Fmax improvements (Design + Foundry This improvement is delivered on the same platform, with no change/penalty to the user end. (AM4 compatibility). And we keep marching.. (Zen3 early look): Simulation data show that with Vermeer (Zen3) we plan to add another 18-20% in 1T perf improvement in the same platform, in 2020. Roughly split as 11-12% IPC + 6-8% Fmax improvement (targets).
No patches for several of these, low performance hit “Zen” Delivered Industry Leading Memory Encryption with Increased Flexibility “Zen” Software Mitigations Robust Hardware Enhanced Spectre Mitigations with “Zen 2” Faster Barrier Implementation (IBPB) IBRS support so that RETPOLINE is optional More efficient indirect branch thread separation (STIBP) For Zen2 we greatly reduced the latency of the IBPB operation, we built a new IBRS mechanism, and we improved the operating performance when STIBP is enabled.
We saw the slowing of Moore’s law and engineered and incredibly flexible and configurable design approach The Infinity Fabric allowed us to implement a scaleable multi-chip solution across both a single socket or dual socket configuration as well as leveraging the same building block for high performance desktop. in core counts, memory bandwidth and IO connectivity. Uneconomical with a traditional monolithic die approach.
In fact, it would cost 70% more. Die partitioning into chiplets and systems based on multi-chiplet packages enable reconfigurable heterogeneous systems capable of economy of scales as well as faster times to market We took this approach to yet higher levels of integration in our Zen2 design
Over 5 years ago, We pulled the lead engineers across the company and launched the effort, to create a scaleable on chip interconnect. - equal to the development cycle of the new “Zen” architecture . Hidden GEM --all of AMD’s new high-performance products released since 2017 use Infinity Fabric to deliver much improved computing performance, power efficiency and security introducing the second-generation Infinity Fabric in Zen2 New NOC enabled a state of the art chiplet design….breaking out the CPU from the IO and Memory The second generation fabric more that doubled the interconnect data rate, creating a super fast highway between the chiplets and very low energy. This was a key factor for earliest access of leading-edge technology for server scale processors enabled by our unprecedented small CPU dies (or CPU Compute Die) designs connected by 2nd generation infinity fabric. Small die yield dramatically better than big die, making them perfectly suited to a new technology that delivers industry leading power efficiency and density. Up to 8 cores on each single compute die Using the right technology for the right job. 14nm. Central I/O and memory is more efficient and both reduces latency to memory, and adds a uniformity of access Finally we get unprecedented configurability.
We don’t create technology for technology sake It all about Performance that Matters That is the end result of the all of the enhancements around Zen2 New Core with a 15% jump in IPC, bringing the memory closer doubled level3 cache, , tightly integrated System on Chip, OS and application partnership Including a deep engagement MS and Linux to optimize schedulers and core affinity to fully leverage Zen2 core density
Zen2 shows what we do best at AMD – innovate to deliver value We saw the opportunity of 7nm to give much more performance and cores in the same performance envelope. We drove more instructions per clock improvements across the design. We seized the opportunity and now Zen2 will deliver for both traditional and emerging workloads We have improved upon our modularity and flexible configurability with a second generation Infinity abric The industry has an insatiable demand for more performance that can be delivered securely Zen2 will indeed deliver leadership performance Zen2 is a great example of the kind of Innovation AMD is know for. Rather that accept the slowing of Moore’s law, we combined a holistic design approach to leverage both process and design – a revolutionary chiplet approach to drive high performance. Zen3 has completed its design phase and we have the next generation beyond that already on the drawing board. Each new Zen generation will continue to leverage BOTH process and innovative design to maximize performance. This is our passion. Our customers have our commitment to deliver what we promise. Our team worked incredibly hard to bring AMD back to high performance. We will not let up, the momentum is now on our side
Comparing AMD "Vega10" and "Vega20" with the same GCN architecture, 64CUs and at 250 watts, running SGEMM benchmark. Vega10 achieves 10 TFLOPS of
FP32 and Vega 20 achieves 13 TFLOPS of FP32.
Testing by AMD performance labs using an AMD Ryzen™ 7 3800X with 16MB L3 cache and 32MB L3 cache at both 2667 MT/s and 3600 MT/s memory speeds.
Results may vary.
Testing by AMD Performance Labs as of 06/03/2019 utilizing 3rd Gen AMD Ryzen™ Processors: 3900X, 3800X, 3700X, 3600X, 3600 and Ryzen™ 7 2700X
in Cinebench R20 1T. Results may vary. RZ3-25 Testing by AMD Performance Labs as of 06/03/2019 utilizing an AMD Ryzen™ 7 1800X and 2700X
in Cinebench R20 1T. Results may vary. RZ3-45
AMD has not been able to reproduce the issue nor is AMD aware of a third party being able to do so.
Updated Feb 28, 2017: Generational IPC uplift for the “Zen” architecture vs. “Piledriver” architecture is +52% with an estimated SPECint_base2006 score compiled
with GCC 4.6 –O2 at a fixed 3.4GHz. Generational IPC uplift for the “Zen” architecture vs. “Excavator” architecture is +64% as measured with Cinebench R15 1T, and
also +64% with an estimated SPECint_base2006 score compiled with GCC 4.6 –O2, at a fixed 3.4GHz. System configs: AMD reference motherboard(s), AMD
Radeon™ R9 290X GPU, 8GB DDR4-2667 (“Zen”)/8GB DDR3-2133 (“Excavator”)/8GB DDR3-1866 (“Piledriver”), Ubuntu Linux 16.x (SPECint_base2006 estimate)
and Windows® 10 x64 RS1 (Cinebench R15). SPECint_base2006 estimates: “Zen” vs. “Piledriver” (31.5 vs. 20.7 | +52%), “Zen” vs. “Excavator” (31.5 vs. 19.2 |
+64%). Cinebench R15 1t scores: “Zen” vs. “Piledriver” (139 vs. 79 both at 3.4G | +76%), “Zen” vs. “Excavator” (160 vs. 97.5 both at 4.0G| +64%). GD-108