© 2013 IBM Corporation
Power Roadmap
POWER8
© 2013 IBM Corporation2
POWER7 Systems Announcements…..
Power 780
Power 770
B Models
2010
Power 795
9119-FHB
Power 750
8233-E8B
Power 710 / 730 B
Models
Power 720 / 740
B Models
Power
Blades
Power 775
9119-F2C
Power 710 / 730 C
Models
Power 720 / 740
C Models
Power 780
Power 770
C Models
2011
P260+
7895-22X
p460
7895-42X
7R1 / 7R2
p24L
p260
7895-22X
2012
Power 780
Power 770
D Models
Pure
Systems
2013
Power 760
Power 750
D Models
Power
710 / 730
D Models
Power 720 / 740
D Models
7R1 / 7R2
7R4
P260+
7895-23A
p460
7895-43X
P270+
7895-24X
© 2013 IBM Corporation3
Power
770+
Power
780+
Power
710+/730+
Power
720+/740+
Power 795
PureSystems
Virtualization & Mgmt.
p260+
p24L
POWER7 Portfolio
Power
750+
Power
760+
PowerLinux
7R1 / 7R2 / 7R4
PureDataPureAppsPureFlex
p460+
p270+
POWER7+
© 2013 IBM Corporation4
2004 2007 2010 2014-2015
POWER7/7+
45/32 nm
POWER8
Eight Cores
On-Chip eDRAM
Power-Optimized Cores
Memory Subsystem ++
SMT++
Reliability +
VSM & VSX
Protection Keys+
POWER6/6+
65/65 nm
Dual Core
High Frequencies
Virtualization +
Memory Subsystem +
Altivec
Instruction Retry
Dynamic Energy Mgmt
SMT +
Protection Keys
POWER5/5+
130/90 nm
Dual Core
Enhanced Scaling
SMT
Distributed Switch +
Core Parallelism +
FP Performance +
Memory Bandwidth +
Virtualization
Power Processor Technology Roadmap
More Cores
SMT+++
Reliability ++
CAPI Support
Transactional Memory
Operating System booted
Future
© 2013 IBM Corporation5
POWER8
22 nm
POWER4/4+
180 / 130 nm
POWER5/5+
130 / 90 nm
POWER6/6+
65 nm
POWER7/7+
45/32 nm
8 Cores
3rd Gen SMT
L3+ On Chip
More Cores
4th Gen SMT
Encryption Logic
CAPI
PCIe Acceleration
Transactional memory
Enhanced Caches
Dual Cores
Dual Threads
External L3
Processor Directions
© 2013 IBM Corporation6
Technology
POWER5
2004
POWER6
2007
POWER7
2010
POWER7+
2012
Compute
Cores
Threads
Caching
On-chip
Off-chip
Bandwidth
Sust. Mem.
Peak I/O
130nm SOI 65nm SOI
45nm SOI
eDRAM
32nm SOI
eDRAM
2
SMT2
2
SMT2
8
SMT4
8
SMT4
1.9MB
36MB
8MB
32MB
2 + 32MB
None
2 + 80MB
None
15GB/s
6GB/s
30GB/s
20GB/s
100GB/s
40GB/s
100GB/s
40GB/s
Processor Roadmap
© 2013 IBM Corporation7
Technology
POWER5
2004
POWER8
POWER6
2007
POWER7
2010
POWER7+
2012
Compute
Cores
Threads
Caching
On-chip
Off-chip
Bandwidth
Sust. Mem.
Peak I/O
130nm SOI 65nm SOI
45nm SOI
eDRAM
32nm SOI
eDRAM
2
SMT2
2
SMT2
8
SMT4
8
SMT4
1.9MB
36MB
8MB
32MB
2 + 32MB
None
2 + 80MB
None
15GB/s
6GB/s
30GB/s
20GB/s
100GB/s
40GB/s
100GB/s
40GB/s
2014
Processor Roadmap
© 2013 IBM Corporation8
Leadership
Performance
• Increase core
throughput at single
thread, SMT2, SMT4, and
SMT8 level
• Large step in per socket
performance
• Enable more robust
multi-socket scaling
System
Innovation
• Higher capacity cache hierarchy
and highly threaded processor
• Enhanced memory bandwidth,
capacity, and expansion
• Dynamic code optimization
• Hardware-accelerated virtual
memory management
Open System
Innovation
• Coherent Accelerator
Processor Interface
(CAPI)
• Agnostic Memory
interface
• Open system software
POWER8 Vision
© 2013 IBM Corporation9
POWER8 Architecture
© 2013 IBM Corporation10
VSU
FXU
IFU
DFU
ISU
LSU
Larger Caching
Structures vs. POWER7
• 2x L1 data cache (64 KB)
• 2x outstanding data cache misses
• 4x translation Cache
Wider Load/Store
• 32B 64B L2 to L1 data bus
• 2x data cache to execution dataflow
Enhanced Prefetch
• Instruction speculation awareness
• Data prefetch depth awareness
•Adaptive bandwidth awareness
• Topology awareness
Execution Improvement
vs. POWER7
• SMT4 SMT8
• 8 dispatch
• 10 issue
• 16 execution pipes:
• 2 FXU, 2 LSU, 2 LU, 4 FPU,
2 VMX, 1 Crypto, 1 DFU,
1 CR, 1 BR
• Larger Issue queues (4 x 16-entry)
• Larger global completion,
Load/Store reorder
• Improved branch prediction
• Improved unaligned storage
access
Core Performance vs . POWER7
~1.6x Single Thread
~2x Max SMT
POWER8 Core
© 2013 IBM Corporation11
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
• Up to 128 MB eDRAM L4
(off-chip)
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory
attach interface
• Integrated PCIe Gen3
• SMP Interconnect
• CAPI (Coherent Accelerator
Processor Interface)
Cores
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 exec pipe
• 2X internal data
flows/queues
• Enhanced prefetching
• 64K data cache,
32K instruction cache
Accelerators
• Crypto & memory expansion
• Transactional Memory
• VMM assist
• Data Move / VM Mobility Energy Management
• On-chip Power Management Micro-controller
• Integrated Per-core VRM
• Critical Path Monitors
Technology
• 22nm SOI, eDRAM, 15 ML 650mm2
L3 Cache & Chip Interconnect
8M L3
Region
Mem. Ctrl.Mem. Ctrl.
SMPLinks
Accelerators
SMPLinks
PCIe
POWER8 Chip Packaging
© 2013 IBM Corporation12
• L2: 512 KB 8 way per core
• L3: 96 MB (12 x 8 MB 8 way Bank)
• “NUCA” Cache policy (Non-Uniform Cache Architecture)
– Scalable bandwidth and latency
– Migrate “Hot” lines to local L2, then local L3 (replicate L2 contained footprint)
• Chip Interconnect: 150 GB/sec x 12 segments per direction = 3.6 TB/sec
L2
L2 L2 L2
L2 L2 L2 L2
L2 L2
L2
L2
L3 Bank L3 Bank L3 Bank
L3 Bank L3 Bank L3 Bank
L3 Bank L3 Bank L3 Bank
L3 BankL3 Bank L3 BankL3 Bank L3 BankL3 Bank
Chip InterconnectMemory Memory
Core Core Core
SMP
Acc
Core Core
CoreCoreCoreCoreCoreCore
SMP
PCIe
Core
POWER8 on Chip Caches
© 2013 IBM Corporation13
…with 16MB
of Cache…Memory
Buffer
DRAM
Chips
DDR Interfaces
POWER8
Link
Scheduler &
Management
16MB
Memory
Cache
Intelligence Moved into Memory
• Scheduling logic, caching structures
• Energy Mgmt, RAS decision point
– Formerly on Processor
– Moved to Memory Buffer
Processor Interface
• 9.6 GB/s high speed interface
• More robust RAS
•“ On-the-fly” lane isolation/repair
• Extensible for innovation build-out
Performance Value
• End-to-end fastpath and data retry (latency)
• Cache latency/bandwidth, partial updates
• Cache write scheduling, prefetch, energy
• 22nm SOI for optimal performance / energy
• 15 metal levels (latency, bandwidth)
POWER8 Memory Buffer Chip
© 2013 IBM Corporation14
Transactional Memory
Power8 Support
New instructions mark beginning and end of transaction
• Hardware ensures region is performed atomically using speculation
Speculation recovery performed in hardware, both registers and memory
“Flattened” Nesting
• Hardware tracks nesting of transactions
• Treats them all as a single large transaction
Application-level instruction interface
Transaction Begin/End Instructions
Explicit abort
Diagnostic register - Transaction Exception and Summary Register
• Indicates cause of transaction failure
Definition
Technique that allows a group of instructions including updates to memory
image to execute speculatively and atomically. This group of instructions is
called a transaction
Value
Reducing programming development
Reducing customer cost (higher SLA / fewer images and higher scalability
Improving performance of legacy software with large sequential components
© 2013 IBM Corporation15
POWER7
I/O
Bridge
GX
Bus
PCIe G2PCI
Devices
PCIe G3
PCI
Device
Native PCIe Gen 3 Support
• Direct processor integration
• Replaces proprietary GX/Bridge
• Low latency
• Gen3 x16 bandwidth (16 Gb/s)
Transport Layer for CAPI Protocol
• Coherently Attach Devices connect to
processor via PCIe
• Protocol encapsulated in PCIe
POWER8
POWER8 Integrated PCI Gen 3
© 2013 IBM Corporation16
Custom
Hardware
Application
POWER8
CAPP
Coherence Bus
PSL
FPGA or ASIC
Customizable Hardware
Application Accelerator
• Specific system SW, middleware, or user application
• Written to durable interface provided by PSL
POWER8
PCIe Gen 3
Transport for encapsulated messages
Processor Service Layer (PSL)
• Present robust, durable interfaces to applications
• Offload complexity / content from CAPP
Virtual Addressing
• Accelerator can work with same memory addresses that the
processors use
• Pointers de-referenced same as the host application
• Removes OS & device driver overhead
Hardware Managed Cache Coherence
• Enables the accelerator to participate in “Locks” as a normal
thread Lowers Latency over IO communication model
POWER8 CAPI (Coherent Accelerator Processor Interface)
© 2013 IBM Corporation17
Socket Performance
© 2013 IBM Corporation18
Client Experience
Handons testing with POWER8 hardware
Advocate/ESP support team
Extended team will monitor client testing progress against test matrix &
collect feedback/experience
ESP Execution
Wkly Interlock Mtg for extended ESP team
Program to include support for..
AIX
IBM i
Linux / Powerlinux
Simplify PowerVM
Client Requirements
Perform meaning testing
Weekly calls
Some minimal education
Contact: Marianne Golden Austin TX marigold@us.ibm.com
512-296-4264
Beta Program
© 2013 IBM Corporation19

@IBM Power roadmap 8

  • 1.
    © 2013 IBMCorporation Power Roadmap POWER8
  • 2.
    © 2013 IBMCorporation2 POWER7 Systems Announcements….. Power 780 Power 770 B Models 2010 Power 795 9119-FHB Power 750 8233-E8B Power 710 / 730 B Models Power 720 / 740 B Models Power Blades Power 775 9119-F2C Power 710 / 730 C Models Power 720 / 740 C Models Power 780 Power 770 C Models 2011 P260+ 7895-22X p460 7895-42X 7R1 / 7R2 p24L p260 7895-22X 2012 Power 780 Power 770 D Models Pure Systems 2013 Power 760 Power 750 D Models Power 710 / 730 D Models Power 720 / 740 D Models 7R1 / 7R2 7R4 P260+ 7895-23A p460 7895-43X P270+ 7895-24X
  • 3.
    © 2013 IBMCorporation3 Power 770+ Power 780+ Power 710+/730+ Power 720+/740+ Power 795 PureSystems Virtualization & Mgmt. p260+ p24L POWER7 Portfolio Power 750+ Power 760+ PowerLinux 7R1 / 7R2 / 7R4 PureDataPureAppsPureFlex p460+ p270+ POWER7+
  • 4.
    © 2013 IBMCorporation4 2004 2007 2010 2014-2015 POWER7/7+ 45/32 nm POWER8 Eight Cores On-Chip eDRAM Power-Optimized Cores Memory Subsystem ++ SMT++ Reliability + VSM & VSX Protection Keys+ POWER6/6+ 65/65 nm Dual Core High Frequencies Virtualization + Memory Subsystem + Altivec Instruction Retry Dynamic Energy Mgmt SMT + Protection Keys POWER5/5+ 130/90 nm Dual Core Enhanced Scaling SMT Distributed Switch + Core Parallelism + FP Performance + Memory Bandwidth + Virtualization Power Processor Technology Roadmap More Cores SMT+++ Reliability ++ CAPI Support Transactional Memory Operating System booted Future
  • 5.
    © 2013 IBMCorporation5 POWER8 22 nm POWER4/4+ 180 / 130 nm POWER5/5+ 130 / 90 nm POWER6/6+ 65 nm POWER7/7+ 45/32 nm 8 Cores 3rd Gen SMT L3+ On Chip More Cores 4th Gen SMT Encryption Logic CAPI PCIe Acceleration Transactional memory Enhanced Caches Dual Cores Dual Threads External L3 Processor Directions
  • 6.
    © 2013 IBMCorporation6 Technology POWER5 2004 POWER6 2007 POWER7 2010 POWER7+ 2012 Compute Cores Threads Caching On-chip Off-chip Bandwidth Sust. Mem. Peak I/O 130nm SOI 65nm SOI 45nm SOI eDRAM 32nm SOI eDRAM 2 SMT2 2 SMT2 8 SMT4 8 SMT4 1.9MB 36MB 8MB 32MB 2 + 32MB None 2 + 80MB None 15GB/s 6GB/s 30GB/s 20GB/s 100GB/s 40GB/s 100GB/s 40GB/s Processor Roadmap
  • 7.
    © 2013 IBMCorporation7 Technology POWER5 2004 POWER8 POWER6 2007 POWER7 2010 POWER7+ 2012 Compute Cores Threads Caching On-chip Off-chip Bandwidth Sust. Mem. Peak I/O 130nm SOI 65nm SOI 45nm SOI eDRAM 32nm SOI eDRAM 2 SMT2 2 SMT2 8 SMT4 8 SMT4 1.9MB 36MB 8MB 32MB 2 + 32MB None 2 + 80MB None 15GB/s 6GB/s 30GB/s 20GB/s 100GB/s 40GB/s 100GB/s 40GB/s 2014 Processor Roadmap
  • 8.
    © 2013 IBMCorporation8 Leadership Performance • Increase core throughput at single thread, SMT2, SMT4, and SMT8 level • Large step in per socket performance • Enable more robust multi-socket scaling System Innovation • Higher capacity cache hierarchy and highly threaded processor • Enhanced memory bandwidth, capacity, and expansion • Dynamic code optimization • Hardware-accelerated virtual memory management Open System Innovation • Coherent Accelerator Processor Interface (CAPI) • Agnostic Memory interface • Open system software POWER8 Vision
  • 9.
    © 2013 IBMCorporation9 POWER8 Architecture
  • 10.
    © 2013 IBMCorporation10 VSU FXU IFU DFU ISU LSU Larger Caching Structures vs. POWER7 • 2x L1 data cache (64 KB) • 2x outstanding data cache misses • 4x translation Cache Wider Load/Store • 32B 64B L2 to L1 data bus • 2x data cache to execution dataflow Enhanced Prefetch • Instruction speculation awareness • Data prefetch depth awareness •Adaptive bandwidth awareness • Topology awareness Execution Improvement vs. POWER7 • SMT4 SMT8 • 8 dispatch • 10 issue • 16 execution pipes: • 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR • Larger Issue queues (4 x 16-entry) • Larger global completion, Load/Store reorder • Improved branch prediction • Improved unaligned storage access Core Performance vs . POWER7 ~1.6x Single Thread ~2x Max SMT POWER8 Core
  • 11.
    © 2013 IBMCorporation11 Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Memory • Up to 230 GB/s sustained bandwidth Bus Interfaces • Durable open memory attach interface • Integrated PCIe Gen3 • SMP Interconnect • CAPI (Coherent Accelerator Processor Interface) Cores • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 exec pipe • 2X internal data flows/queues • Enhanced prefetching • 64K data cache, 32K instruction cache Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility Energy Management • On-chip Power Management Micro-controller • Integrated Per-core VRM • Critical Path Monitors Technology • 22nm SOI, eDRAM, 15 ML 650mm2 L3 Cache & Chip Interconnect 8M L3 Region Mem. Ctrl.Mem. Ctrl. SMPLinks Accelerators SMPLinks PCIe POWER8 Chip Packaging
  • 12.
    © 2013 IBMCorporation12 • L2: 512 KB 8 way per core • L3: 96 MB (12 x 8 MB 8 way Bank) • “NUCA” Cache policy (Non-Uniform Cache Architecture) – Scalable bandwidth and latency – Migrate “Hot” lines to local L2, then local L3 (replicate L2 contained footprint) • Chip Interconnect: 150 GB/sec x 12 segments per direction = 3.6 TB/sec L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 BankL3 Bank L3 BankL3 Bank L3 BankL3 Bank Chip InterconnectMemory Memory Core Core Core SMP Acc Core Core CoreCoreCoreCoreCoreCore SMP PCIe Core POWER8 on Chip Caches
  • 13.
    © 2013 IBMCorporation13 …with 16MB of Cache…Memory Buffer DRAM Chips DDR Interfaces POWER8 Link Scheduler & Management 16MB Memory Cache Intelligence Moved into Memory • Scheduling logic, caching structures • Energy Mgmt, RAS decision point – Formerly on Processor – Moved to Memory Buffer Processor Interface • 9.6 GB/s high speed interface • More robust RAS •“ On-the-fly” lane isolation/repair • Extensible for innovation build-out Performance Value • End-to-end fastpath and data retry (latency) • Cache latency/bandwidth, partial updates • Cache write scheduling, prefetch, energy • 22nm SOI for optimal performance / energy • 15 metal levels (latency, bandwidth) POWER8 Memory Buffer Chip
  • 14.
    © 2013 IBMCorporation14 Transactional Memory Power8 Support New instructions mark beginning and end of transaction • Hardware ensures region is performed atomically using speculation Speculation recovery performed in hardware, both registers and memory “Flattened” Nesting • Hardware tracks nesting of transactions • Treats them all as a single large transaction Application-level instruction interface Transaction Begin/End Instructions Explicit abort Diagnostic register - Transaction Exception and Summary Register • Indicates cause of transaction failure Definition Technique that allows a group of instructions including updates to memory image to execute speculatively and atomically. This group of instructions is called a transaction Value Reducing programming development Reducing customer cost (higher SLA / fewer images and higher scalability Improving performance of legacy software with large sequential components
  • 15.
    © 2013 IBMCorporation15 POWER7 I/O Bridge GX Bus PCIe G2PCI Devices PCIe G3 PCI Device Native PCIe Gen 3 Support • Direct processor integration • Replaces proprietary GX/Bridge • Low latency • Gen3 x16 bandwidth (16 Gb/s) Transport Layer for CAPI Protocol • Coherently Attach Devices connect to processor via PCIe • Protocol encapsulated in PCIe POWER8 POWER8 Integrated PCI Gen 3
  • 16.
    © 2013 IBMCorporation16 Custom Hardware Application POWER8 CAPP Coherence Bus PSL FPGA or ASIC Customizable Hardware Application Accelerator • Specific system SW, middleware, or user application • Written to durable interface provided by PSL POWER8 PCIe Gen 3 Transport for encapsulated messages Processor Service Layer (PSL) • Present robust, durable interfaces to applications • Offload complexity / content from CAPP Virtual Addressing • Accelerator can work with same memory addresses that the processors use • Pointers de-referenced same as the host application • Removes OS & device driver overhead Hardware Managed Cache Coherence • Enables the accelerator to participate in “Locks” as a normal thread Lowers Latency over IO communication model POWER8 CAPI (Coherent Accelerator Processor Interface)
  • 17.
    © 2013 IBMCorporation17 Socket Performance
  • 18.
    © 2013 IBMCorporation18 Client Experience Handons testing with POWER8 hardware Advocate/ESP support team Extended team will monitor client testing progress against test matrix & collect feedback/experience ESP Execution Wkly Interlock Mtg for extended ESP team Program to include support for.. AIX IBM i Linux / Powerlinux Simplify PowerVM Client Requirements Perform meaning testing Weekly calls Some minimal education Contact: Marianne Golden Austin TX marigold@us.ibm.com 512-296-4264 Beta Program
  • 19.
    © 2013 IBMCorporation19