“Bulldozer” and “Bobcat”AMD’s Latest x86 Core InnovationsHotChips22
Two x86 Cores Tuned for Target MarketsMainstream Client and Server Markets“Bulldozer”Performance & ScalabilityLow PowerMarketsSmallDie AreaCloud Clients Optimized“Bobcat”Flexible, Low Power & Small
The Bulldozer Architecture“Bulldozer”An innovative design that delivers true core functionality by pairing two integer execution cores with components that can be shared as neededInstruction Set extensions to increase capability of the designExtensive new power efficiency innovationsManufactured on the latest 32nm SOI technologyFetchDecodeIntegerSchedulerIntegerSchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheShared L2 Cache
Approaches for Supporting Multiple ThreadsSMTForce two threads into one core
Threads compete  for resources
Relies on under- utilization CMPDedicated cores for each thread
Traditional brute force approach
Each core is over- provisionedHowever, there is another way . . .
Bulldozer: Two Strong ThreadsHyperthreaded, single-core chip“Bulldozer”FetchFetchDecodeDecodeIntegerSchedulerIntegerSchedulerIntegerSchedulerFP SchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipelineCORE 1128-bitFMAC128-bitFMACPipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheL1 DCacheShared L2 CacheL2 Cache
DedicatedComponentsShared at the module levelShared at the chip levelSharing ResourcesFetchThe Bulldozer architecture has shared and dedicated componentsThe shared components:Help reduce power consumptionHelp reduce die space (cost)The dedicated components:Help increase performance and scalabilityBulldozer dynamically switches between shared and dedicated components to maximize performance per wattDecodeFP SchedulerIntSchedulerIntSchedulerCore 1Core 2L1 DCacheL1 DCache128-bit FMAC128-bit FMACPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipelineShared L2 CacheShared L3 Cache and NB
Building a Bulldozer-Based ChipFetchDecodeIntSchedulerIntSchedulerFP SchedulerShared L3 Cache and NBIntegrated Memory ControllerIntegrated Northbridge ControllerEach chip is composed of multiple bulldozer modulesModule divisions are transparent to shared hardware, operating system or applicationThe modular architecture speeds chip development and increases product flexibility
Bulldozer Summary“Bulldozer”Bulldozer is the next generation of AMD high-performance processor core technologyThis new core is a completely new design from the ground upBulldozer will be utilized in client and server designs in 2011AMD delivers 33% more cores and an estimated 50% increase in throughput in the same power envelope as Magny-Cours*FetchDecodeIntegerSchedulerIntegerSchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheShared L2 Cache*Based on internal AMD modeling using benchmark simulations
Two x86 Cores Tuned for Target MarketsMainstream Client and Server Markets“Bulldozer”Performance & ScalabilityLow PowerMarketsSmallDie AreaCloud Clients Optimized“Bobcat”Flexible, Low Power & Small
Bobcat Design GoalsA small, efficient, low power x86 coreExcellent performanceSynthesizable with small number of custom arraysEasily Portable across process technologies
“Bobcat” x86 Core: Small, Efficient and Strong“Bobcat” CoreSub one-watt capable core

AMD Hot Chips Bulldozer & Bobcat Presentation

  • 1.
    “Bulldozer” and “Bobcat”AMD’sLatest x86 Core InnovationsHotChips22
  • 2.
    Two x86 CoresTuned for Target MarketsMainstream Client and Server Markets“Bulldozer”Performance & ScalabilityLow PowerMarketsSmallDie AreaCloud Clients Optimized“Bobcat”Flexible, Low Power & Small
  • 3.
    The Bulldozer Architecture“Bulldozer”Aninnovative design that delivers true core functionality by pairing two integer execution cores with components that can be shared as neededInstruction Set extensions to increase capability of the designExtensive new power efficiency innovationsManufactured on the latest 32nm SOI technologyFetchDecodeIntegerSchedulerIntegerSchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheShared L2 Cache
  • 4.
    Approaches for SupportingMultiple ThreadsSMTForce two threads into one core
  • 5.
    Threads compete for resources
  • 6.
    Relies on under-utilization CMPDedicated cores for each thread
  • 7.
  • 8.
    Each core isover- provisionedHowever, there is another way . . .
  • 9.
    Bulldozer: Two StrongThreadsHyperthreaded, single-core chip“Bulldozer”FetchFetchDecodeDecodeIntegerSchedulerIntegerSchedulerIntegerSchedulerFP SchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipelineCORE 1128-bitFMAC128-bitFMACPipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheL1 DCacheShared L2 CacheL2 Cache
  • 10.
    DedicatedComponentsShared at themodule levelShared at the chip levelSharing ResourcesFetchThe Bulldozer architecture has shared and dedicated componentsThe shared components:Help reduce power consumptionHelp reduce die space (cost)The dedicated components:Help increase performance and scalabilityBulldozer dynamically switches between shared and dedicated components to maximize performance per wattDecodeFP SchedulerIntSchedulerIntSchedulerCore 1Core 2L1 DCacheL1 DCache128-bit FMAC128-bit FMACPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipelineShared L2 CacheShared L3 Cache and NB
  • 11.
    Building a Bulldozer-BasedChipFetchDecodeIntSchedulerIntSchedulerFP SchedulerShared L3 Cache and NBIntegrated Memory ControllerIntegrated Northbridge ControllerEach chip is composed of multiple bulldozer modulesModule divisions are transparent to shared hardware, operating system or applicationThe modular architecture speeds chip development and increases product flexibility
  • 12.
    Bulldozer Summary“Bulldozer”Bulldozer isthe next generation of AMD high-performance processor core technologyThis new core is a completely new design from the ground upBulldozer will be utilized in client and server designs in 2011AMD delivers 33% more cores and an estimated 50% increase in throughput in the same power envelope as Magny-Cours*FetchDecodeIntegerSchedulerIntegerSchedulerFP SchedulerPipelinePipelinePipelinePipelinePipelinePipelinePipelinePipeline128-bitFMAC128-bitFMACL1 DCacheL1 DCacheShared L2 Cache*Based on internal AMD modeling using benchmark simulations
  • 13.
    Two x86 CoresTuned for Target MarketsMainstream Client and Server Markets“Bulldozer”Performance & ScalabilityLow PowerMarketsSmallDie AreaCloud Clients Optimized“Bobcat”Flexible, Low Power & Small
  • 14.
    Bobcat Design GoalsAsmall, efficient, low power x86 coreExcellent performanceSynthesizable with small number of custom arraysEasily Portable across process technologies
  • 15.
    “Bobcat” x86 Core:Small, Efficient and Strong“Bobcat” CoreSub one-watt capable core

Editor's Notes

  • #2 Before we start: There is a lot of technical detail available below what we are about to show you, this presentation is intended to give you a high level overview of both designs and AMD’s expectations for each. The engineering detail will be presented by the two chief architects for the designs at the upcoming HotChips conference on the Stanford Campus next week. Please feel free to ask detailed questions along the way if you would like to hear more about a specific feature or operation. At a higher level, this shows innovation at AMD remains alive and well. Please think of these core architectures within the context of the new, revitalized AMD built around our focus as a design company since the spin-off of GlobalFoundries, our new VISION platforms and marketing program, and our Fusion APU strategy. “Bobcat” and “Bulldozer” are the latest chapters in that story and form a solid foundation for AMD products for years to come.
  • #3 The two cores, although both x86 compatible, are completely different for a reason. The workloads, end equipment markets and usage scenarios require different approaches and that’s what AMD recognized at the onset of this effort. Think of “Bulldozer”, just as the name implies, as the heavy lifter. It will appear in server, as well as mainstream and high performance client products. “Bobcat” is small and highly efficient. It utilizes those characteristics to address the highly portable netbook / notebook markets.So, 2 different designs, with different goals in mind.
  • #4 So starting with Bulldozer, here’s a block diagram that shows its distinguishing features. We are taking 2 of the most frequently used parts of processor, the integer cores and adding a hefty, shared floating point capability to deliver 2 robust threads much more efficiently than Hyper-threading where a single integer core is used.We have also added a number of instruction set extensions to increase the design’s capabilities and done extensive work on power management to improve performance per watt even further.The 32nm process technology delivers additional savings in terms of area and power consumption; this our first process technology to utilize high-K metal gate.
  • #5 The previous slide hinted to a key differentiator of Bulldozer that bears more explanation.A big conversation in the industry these last few years is how to continue to increase processing performance as we reach plateaus in clock speed.Essentially there have been two approaches used – SMT, which stands for Simultaneous Multi-Threading and CMP, which stands for Core Multi-Processing. CMP is probably the easiest to understand, because it can be described as “if one core is good, two must be better” and it is.. So CMP architectures take a complete core and replicate it.SMT is a little more complex to picture, but because of the way instructions are decoded and executed, it’s possible to have two concurrently tasks running on a single core.Bulldozer takes a third approach..
  • #7 On the first Bulldozer slide we mentioned “true core functionality” – so what exactly does that mean. There are two complete integer units in the Bulldozer design for the most common type of compute tasks, so it functions like a dual-core design allowing maximum performance rather than pushing two threads through a single core. However, we don’t replicate everything on the core like a CMP either. Floating point operations on Bulldozer use a shared scheduler and two 128-bit Multiply and Accumulate Units. Extensive research went into analyzing workloads ahead of this design, so we feel the division between shared and discrete components is the right one. And by the way, the idea of sharing hardware is hardly new, right? Shared Cache, the Northbridge, etc. have been shared across multi-core designs for years already.
  • #8 You can see that larger view of shared hardware components here as we raise our view up to the chip level. On an 8 core Bulldozer design you can see how Bulldozer “modules” are grouped together to share L3 cache and Northbridge, and combined with a memory controller and Northbridge controller to form the major components of the chip. And again, the OS and applications see true cores; the shared floating point components and L2 cache are transparent to the code.
  • #10 So that covers Bulldozer, now let’s cover AMD’s new core design specifically for the low-power x86 market. “Bobcat” is small and highly efficient. It utilizes those characteristics to address the highly portable netbook / notebook markets.
  • #11 Bobcat is a little bit more straight-forward to understand than Bulldozer, but it too, has some highly differentiated features to it. And these were stated from the very beginning because of AMD’s understanding of the final products requirements.
  • #12 So those were the goals. Where did we end up? Bobcat can operate below one-watt (with a resulting reduction in performance) – that’s not a statement about any resulting products, but it does give you some sense of the core’s power envelope. The next bullets here are critical – out-of-order execution means higher performance than an in-order execution core like Atom, pure and simple. Synthesizeable means it uses few custom logic arrays that are more dependent on the specifics of the underlying manufacturing technology for optimal performance and that it can be more easily integrated into SoC designs for faster turnaround of new variations.No limitations on the instruction set either, including support for virtualization.AMD estimates 90% of today’s mainstream CPU performance in less than half the silicon area and a fraction of the power.Will appear early next year in Ontario, which is ahead of schedule.
  • #13 Technical details if needed.
  • #14 The need for optimal energy-efficient balance of CPU and GPU represents the beginning of a new era of computing in 2011, the era of the accelerated processing unit or APU, which combines both on a single piece of silicon.The Fusion of CPU and GPU compute power is what the next chapter in visual computing requires – a powerful visual computing experience at home or on the go without compromise. Our AMD Fusion™ design is driven by mobility and is based on a low-power visual compute architecture that will enhance active and resting battery life while increasing both CPU and GPU performance. This is the culmination of the vision of ‘One AMD’ and only AMD can deliver the GPU and CPU combination that will be the future of computing