SlideShare a Scribd company logo
1 of 90
Eserver pSeries 
"Any sufficiently advanced technology will 
have the appearance of magic." 
…Arthur C. Clarke 
© 2003 IBM Corporation 
Section 2: The Technology
^Eserver pSeries 
Section Objectives 
 On completion of this unit you should be able to: 
– Describe the relationship between technology and 
solutions. 
– List key IBM technologies that are part of the POWER5 
products. 
– Be able to describe the functional benefits that these 
technologies provide. 
– Be able to discuss the appropriate use of these 
technologies. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
IBM and Technology 
Solutions 
Products 
Technology 
Science 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Technology and innovation 
 Having technology available is a necessary first 
step. 
 Finding creative new ways to use the technology 
for the benefit of our clients is what innovation is 
about. 
 Solution design is an opportunity for innovative 
application of technology. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
When technology won’t ‘fix’ the problem 
 When the technology is not related to the problem. 
 When the client has unreasonable expectations. 
© 2003 Concepts of Solution Design IBM Corporation
Eserver pSeries 
© 2003 IBM Corporation 
POWER5 Technology
^Eserver pSeries 
POWER4 and POWER5 Cores 
POWER4 Core POWER5 Core 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
POWER5 
 Designed for entry and high-end 
servers 
 Enhanced memory subsystem 
 Improved performance 
 Simultaneous Multi-Threading 
 Hardware support for Shared 
Processor Partitions (Micro- 
Partitioning) 
 Dynamic power management 
 Compatibility with existing 
POWER4 systems 
 Enhanced reliability, 
availability, serviceability 
SSMMTT CCoorree 
SSMMTT CCoorree 
11..99 MMBB LL22 CCaacchhee 
Enhanced distributed switch 
LL33 DDiirr MMeemm CCttrrll 
Chip-Chip / MCM-MCM / SMPLink 
GX+ 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Enhanced memory subsystem 
 Improved L1 cache design 
– 2-way set associative i-cache 
– 4-way set associative d-cache 
– New replacement algorithm (LRU vs. FIFO) 
 Larger L2 cache 
– 1.9 MB, 10-way set associative 
 Improved L3 cache design 
– 36 MB, 12-way set associative 
– L3 on the processor side of the fabric 
– Satisfies L2 cache misses more frequently 
– Avoids traffic on the interchip fabric 
 On-chip L3 directory and memory controller 
– L3 directory on the chip reduces off-chip delays 
after an L2 miss 
– Reduced memory latencies 
 Improved pre-fetch algorithms 
SSMMTT CCoorree 
SSMMTT CCoorree 
11..99 MMBB LL22 CCaacchhee 
Enhanced distributed switch 
LL33 DDiirr MMeemm CCttrrll 
Chip-Chip / MCM-MCM / SMPLink 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Enhanced memory subsystem 
POWER4 system structure POWER5 system structure 
PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr 
© 2003 Concepts of Solution Design IBM Corporation 
L3 
Cache 
L3 
Cache 
L2 
Cache 
L2 
Cache 
L2 
Cache 
L2 
Cache 
Fabric 
controller 
Fabric 
controller 
Memory 
controller 
Memory 
controller 
Memory Memory 
L3 
Cache 
L3 
Cache 
Fabric 
controller 
Fabric 
controller 
Memory 
controller 
Memory 
controller 
Memory Memory 
Reduced 
L3 latency 
Faster 
access to 
memory 
Larger 
SMPs 
64-way 
Number of 
chips cut 
in half 
LL33 DDiirr 
rrii DD 33LL
^Eserver pSeries 
Simultaneous Multi-Threading (SMT) 
 What is it? 
 Why would I want it? 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Out-of-order processing 
Branch 
pipeline 
Load/store 
pipeline 
MP ISS RF EX WB Xfer 
MP ISS RF EA DC Fmt 
WB Xfer 
MP ISS RF EX WB Xfer 
Fixed-point 
pipeline 
Floating-point 
WB 
pipeline 
POWER4 pipeline 
MP ISS RF F6 
Xfer 
F6 F6 F6 F6 F6 
D1 D2 D3 Xfer GD 
Branch redirects 
Instruction Fetch 
IC 
IF BP 
CP 
Instruction Crack and 
Group Formation 
D0 
Interrupts & Flushes 
POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF 
= register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point 
execution pipe, Fmt = data format, WB = write back, and CP = group commit) 
POWER5 pipeline 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
 Execution unit utilization is low in today’s 
microprocessors 
 25% of average execution unit utilization across 
a broad spectrum of environments 
FX0 
FX1 
LS0 
LS1 
FP0 
FP1 
BFX 
CRL 
Processor Cycles 
Multi-threading evolution 
ehcaC-i 
Memory 
Instruction streams 
Next evolution step 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
 Two instruction streams, one thread at any instance 
 Hardware swaps in second thread when long-latency event 
FX0 
FX1 
LS0 
LS1 
FP0 
FP1 
BFX 
CRL 
Coarse-grained multi-threading 
 Swap requires several cycles 
Swap 
Swap 
Processor Cycles 
occurs 
ehcaC-i 
Memory 
Instruction streams 
Swap 
Next evolution step 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Coarse-grained multi-threading (Cont.) 
 Processor (for example, RS64-IV) is able to store context for 
two threads 
– Rapid switching between threads minimizes lost cycles due 
to I/O waits and cache misses. 
– Can yield ~20% improvement for OLTP workloads. 
 Coarse-grained multi-threading only beneficial where 
number of active threads exceeds 2x number of CPUs 
– AIX must create a “dummy” thread if there are insufficient 
numbers of real threads. 
• Unnecessary switches to “dummy” threads can degrade 
performance ~20% 
• Does not work with dynamic CPU deallocation 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
 Variant of coarse-grained multi-threading 
 Thread execution in round-robin fashion 
 Cycle remains unused when a thread 
encounters a long-latency event 
FX0 
FX1 
LS0 
LS1 
FP0 
FP1 
BFX 
CRL 
Processor Cycles 
Fine-grained multi-threading 
ehcaC-i 
Memory 
Instruction streams 
Next evolution step 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
POWER5 pipeline 
Out-of-order processing 
Branch 
pipeline 
MP ISS RF EX WB Xfer 
MP ISS RF EA DC WB Xfer 
MP ISS RF EX WB Xfer 
MP ISS RF F6 
Xfer 
F6 F6 F6 F6 F6 
D1 D2 D3 Xfer GD 
Branch redirects 
Instruction Fetch 
IC 
IF BP 
CP 
Instruction Crack and 
Group Formation 
D0 
Interrupts & Flushes 
WB 
Fmt 
POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF 
= register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point 
execution pipe, Fmt = data format, WB = write back, and CP = group commit) 
© 2003 Concepts of Solution Design IBM Corporation 
IF 
CP 
Load/store 
pipeline 
Fixed-point 
pipeline 
Floating-point 
pipeline 
POWER4 pipeline
^Eserver pSeries 
 Reduction in unused execution 
units results in a 25-40% boost and 
even more! 
FX0 
FX1 
LS0 
LS1 
FP0 
FP1 
BFX 
CRL 
Simultaneous multi-threading (SMT) 
Processor Cycles 
ehcaC-i 
Memory 
Instruction streams 
First evolution step 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Simultaneous multi-threading (SMT) (Cont.) 
 Each chip appears as a 4-way SMP to software 
– Allows instructions from two threads to execute 
simultaneously 
 Processor resources optimized for enhanced SMT 
performance 
– No context switching, no dummy threads 
 Hardware, POWER Hypervisor, or OS controlled thread 
priority 
– Dynamic feedback of shared resources allows for balanced 
thread execution 
 Dynamic switching between single and multithreaded mode 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Dynamic resource balancing 
 Threads share many 
resources 
– Global Completion Table, 
Branch History Table, 
Translation Lookaside Buffer, 
and so on 
 Higher performance realized 
when resources balanced 
across threads 
– Tendency to drift toward 
extremes accompanied by 
reduced performance 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Adjustable thread priority 
 Instances when unbalanced 
execution is desirable 
– No work for opposite thread 
– Thread waiting on lock 
– Software determined non 
uniform balance 
– Power management 
 Control instruction decode 
rate 
– Software/hardware controls 
eight priority levels for each 
thread 
22 
cycle 
1 
per 1 
Instructions 1 
1 
1 
0 
0 
0 
Single-threaded operation 
0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1 
Thread 0 Priority - Thread 1 Priority 
Thread 0 IPC Thread 1 IPC 
Power 
Save 
Mode 
Hardware thread priorities 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Single-threaded operation 
 Advantageous for execution unit 
limited applications 
– Floating or fixed point intensive 
workloads 
 Execution unit limited applications 
provide minimal performance 
leverage for SMT 
– Extra resources necessary for SMT 
provide higher performance benefit 
when dedicated to single thread 
 Determined dynamically on a per 
processor basis 
Thread states 
Dormant 
Hardware 
or Software 
Software 
Null 
Software 
Active 
Software 
© 2003 Concepts of Solution Design IBM Corporation
Eserver pSeries 
© 2003 IBM Corporation 
Micro-Partitioning
^Eserver pSeries 
Micro-Partitioning overview 
 Mainframe inspired technology 
 Virtualized resources shared by multiple partitions 
 Benefits 
– Finer grained resource allocation 
– More partitions (Up to 254) 
– Higher resource utilization 
 New partitioning model 
– POWER Hypervisor 
– Virtual processors 
– Fractional processor capacity partitions 
– Operating system optimized for Micro-Partitioning exploitation 
– Virtual I/O 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Processor terminology 
Shared processor 
partition 
SMT Off 
Shared processor 
partition 
SMT On 
Entitled capacity 
Shared processor pool 
Dedicated 
processor partition 
SMT Off 
Logical (SMT) 
Virtual 
Shared 
Dedicated 
Inactive (CUoD) 
Deconfigured 
Installed physical 
processors 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared processor partitions 
 Micro-Partitioning allows for multiple partitions to 
share one physical processor 
 Up to 10 partitions per physical processor 
 Up to 254 partitions active at the same time 
 Partition’s resource definition 
– Minimum, desired, and maximum values for each 
resource 
– Processor capacity 
– Virtual processors 
– Capped or uncapped 
• Capacity weight 
– Dedicated memory 
• Minimum of 128 MB and 16 MB increments 
– Physical or virtual I/O resources 
CPU 0 CPU 1 
CPU 3 CPU 4 
LPAR 1 LPAR 2 
LPAR 3 LPAR 4 
LPAR 5 LPAR 6 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Understanding min/max/desired resource values 
 The desired value for a resource is given to a 
partition if enough resource is available. 
 If there is not enough resource to meet the desired 
value, then a lower amount is allocated. 
 If there is not enough resource to meet the min 
value, the partition will not start. 
 The maximum value is only used as an upper limit 
for dynamic partitioning operations. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Partition capacity entitlement 
 Processing units 
– 1.0 processing unit represents one 
physical processor 
 Entitled processor capacity 
– Commitment of capacity that is 
reserved for the partition 
– Set upper limit of processor 
utilization for capped partitions 
– Each virtual processor must be 
granted at least 1/10 of a 
processing unit of entitlement 
 Shared processor capacity is 
always delivered in terms of whole 
physical processors 
Minimum requirement 
0.1 processing units 
0.5 processing unit 0.4 processing unit 
Processing capacity 
1 physical processor 
1.0 processing units 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Capped and uncapped partitions 
 Capped partition 
– Not allowed to exceed its entitlement 
 Uncapped partition 
– Is allowed to exceed its entitlement 
 Capacity weight 
– Used for prioritizing uncapped partitions 
– Value 0-255 
– Value of 0 referred to as a “soft cap” 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Partition capacity entitlement example 
 Shared pool has 2.0 processing units 
available 
 LPARs activated in sequence 
 Partition 1 activated 
– Min = 1.0, max = 2.0, desired = 1.5 
– Starts with 1.5 allocated processing units 
 Partition 2 activated 
– Min = 1.0, max = 2.0, desired = 1.0 
– Does not start 
 Partition 3 activated 
– Min = 0.1, max = 1.0, desired = 0.8 
– Starts with 0.5 allocated processing units 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Understanding capacity allocation – An example 
 A workload is run under different configurations. 
 The size of the shared pool (number of physical 
processors) is fixed at 16. 
 The capacity entitlement for the partition is fixed 
at 9.5. 
 No other partitions are active. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Uncapped – 16 virtual processors 
Uncapped (16PPs/16VPs/9.5CE) 
15 
10 
5 
 16 virtual processors. 
 Uncapped. 
 Can use all available resource. 
 The workload requires 26 minutes to complete. 
© 2003 Concepts of Solution Design IBM Corporation 
0 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
Elapsed time
^Eserver pSeries 
Uncapped – 12 virtual processors 
Uncapped (16PPs/12VPs/9.5CE) 
15 
10 
5 
 12 virtual processors. 
 Even though the partition is uncapped, it can only use 12 
processing units. 
 The workload now requires 27 minutes to complete. 
© 2003 Concepts of Solution Design IBM Corporation 
0 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
Elapsed time
^Eserver pSeries 
Capped (16PPs/12VPs/9.5E) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
Elapses time 
© 2003 Concepts of Solution Design IBM Corporation 
Capped 
15 
10 
5 
0 
 The partition is now capped and resource utilization is 
limited to the capacity entitlement of 9.5. 
– Capping limits the amount of time each virtual processor is 
scheduled. 
– The workload now requires 28 minutes to complete.
^Eserver pSeries 
Dynamic partitioning operations 
 Add, move, or remove processor capacity 
– Remove, move, or add entitled shared processor capacity 
– Change between capped and uncapped processing 
– Change the weight of an uncapped partition 
– Add and remove virtual processors 
• Provided CE / VP > 0.1 
 Add, move, or remove memory 
– 16 MB logical memory block 
 Add, move, or remove physical I/O adapter slots 
 Add or remove virtual I/O adapter slots 
 Min/max values defined for LPARs set the bounds within 
which DLPAR can work 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Dynamic LPAR 
Standard on all new systems 
Part#1 
Production 
AIX 
5L 
HMC 
Part#2 Part#3 Part#4 
Test/ 
Dev 
Move resources 
between live 
AIX 
5L 
File/ 
Print 
Linux 
Legacy 
Apps 
partitions 
AIX 
5L 
Hypervisor 
© 2003 Concepts of Solution Design IBM Corporation
Eserver pSeries 
© 2003 IBM Corporation 
Firmware 
POWER Hypervisor
^Eserver pSeries 
POWER Hypervisor strategy 
 New Hypervisor for POWER5 systems 
– Further convergence with iSeries 
– But brands will retain unique value propositions 
– Reduced development effort 
– Faster time to market 
 New capabilities on pSeries servers 
– Shared processor partitions 
– Virtual I/O 
 New capability on iSeries servers 
– Can run AIX 5L 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
POWER Hypervisor component sourcing 
H-Call Interface 
Nucleus (SLIC) 
Virtual I/O 
Bus recovery Dump 
Drawer concurrent maint. Slot/tower concurrent maint. 
Shared processor LPAR 
Capacity on Demand 
Virtual Ethernet 
Location codes 
Load from flash 
Message passing 
LAN IOA VLAN IOA 
FSP 
NVRAM 
HSC VVLLAANN 
© 2003 Concepts of Solution Design IBM Corporation 
pSeries 
iSeries 
255 partitions 
Partition on demand 
HMC 
SCSI IOA 
I/O configuration
^Eserver pSeries 
POWER Hypervisor functions 
 Same functions as POWER4 Hypervisor. 
– Dynamic LPAR 
– Capacity Upgrade on Demand 
 New, active functions. 
– Dynamic Micro-Partitioning 
– Shared processor pool 
– Virtual I/O 
– Virtual LAN 
 Machine is always in LPAR mode. 
Dynamic LPAR 
– Even with all resources dedicated to one OS 
Dynamic Micro-Partitioning 
CPU 0 CPU 1 
CPU 2 CPU 3 
Shared processor pools 
SSMMTT C Coroere 
SSMMT TC Coorere 
LL33 D Dirir MMeemm C Ctrtlrl 
SSMMTT C Coorere 
11.9.9 M MBB L L22 C Caacchhee 
Enhanced distributed switch 
Chip-Chip / MCM-MCM / SMPLink 
11.9. 9M MBB L L22 C Caacchhee 
SSMMTT C Coorere 
Enhanced distributed switch 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
SSMMTT C Coorere 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
Chip-Chip / MCM-MCM / SMPLink 
SSMMTT C Coorere 
SSMMTT C Coorere 
11.9. 9M MBB L L22 C Caacchhee 
11.9.9 M MBB L L22 C Caacchhee 
SSMMTT C Coorere 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
Virtual I/O 
Disk LAN 
Capacity Upgrade on Demand 
Planned 
Actual 
Client Capacity Growth 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
POWER Hypervisor implementation 
 Design enhancements to previous POWER4 
implementation enable the sharing of processors 
by multiple partitions 
– Hypervisor decrementer (HDECR) 
– New Processor Utilization Resource Register (PURR) 
– Refine virtual processor objects 
• Does not include physical characteristics of the processor 
– New Hypervisor calls 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
POWER Hypervisor processor dispatch 
 Manage a set of processors on the machine 
(shared processor pool). 
 POWER5 generates a 10 ms dispatch window. 
– Minimum allocation is 1 ms per physical 
processor. 
 Each virtual processor is guaranteed to get its 
entitled share of processor cycles during each 10 
ms dispatch window. 
– ms/VP = CE * 10 / VPs 
 The partition entitlement is evenly distributed 
among the online virtual processors. 
 Once a capped partition has received its CE 
within a dispatch interval, it becomes not-runnable. 
 A VP dispatched within 1 ms of the end of the 
Virtual processor capacity entitlement for 
six shared processor partitions 
CPU 0 CPU 1 
SSMMTT C Coorere 
SSMMTT C Coorere 
SSMMTT C Coorere 
LL33 D Dirir MMeemm C Ctrtlrl 
SSMMTT C Coroere 
11.9.9 M MBB L L22 C Caacchhee 
11.9.9 M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Enhanced distributed switch 
Chip-Chip / MCM-MCM / SMPLink 
Chip-Chip / MCM-MCM / SMPLink 
SSMMTT C Coorere 
SSMMTT C Coroere 
SSMMTT C Coorere 
LL33 D Dirir MMeemm C Ctrtlrl 
SSMMTT C Coorere 
11.9.9 M MBB L L22 C Caacchhee 
11.9.9 M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Enhanced distributed switch 
Chip-Chip / MCM-MCM / SMPLink 
Chip-Chip / MCM-MCM / SMPLink 
CPU 2 CPU 3 
POWER 
Hypervisor’s 
processor 
dispatch 
dispatch interval will receive half its CE at the 
start of the next dispatch interval. Shared processor pool 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Dispatching and interrupt latencies 
 Virtual processors have dispatch latency. 
 Dispatch latency is the time between a virtual 
processor becoming runnable and being actually 
dispatched. 
 Timers have latency issues also. 
 External interrupts have latency issues also. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared processor pool 
 Processors not associated with 
dedicated processor partitions. 
 No fixed relationship between virtual 
processors and physical processors. 
 The POWER Hypervisor attempts to 
use the same physical processor. 
– Affinity scheduling 
– Home node 
Virtual processor capacity entitlement for 
six shared processor partitions 
SSMMTT C Coorere 
SSMMTT C Coorere 
11.9. 9M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
SSMMTT C Coorere 
SSMMTT C Coorere 
11.9.9 M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
SSMMT TC Coorere 
SSMMTT C Coorere 
11.9. 9M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
POWER 
Hypervisor’s 
processor 
dispatch 
SSMMTT C Coorere 
SSMMTT C Coorere 
11.9. 9M MBB L L22 C Caacchhee 
Enhanced distributed switch 
LL33 D Dirir MMeemm C Ctrtlrl 
Chip-Chip / MCM-MCM / SMPLink 
CPU 0 CPU 1 CPU 2 CPU 3 
Shared processor pool 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Affinity scheduling 
 When dispatching a VP, the POWER Hypervisor attempts to 
preserve affinity by using: 
– Same physical processor as before, or 
– Same chip, or 
– Same MCM 
 When a physical processor becomes idle, the POWER 
Hypervisor looks for a runnable VP that: 
– Has affinity for it, or 
– Has affinity to no-one, or 
– Is uncapped 
 Similar to AIX affinity scheduling 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Operating system support 
 Micro-Partitioning capable operating systems need to be modified 
to cede a virtual processor when they have no runnable work 
– Failure to do this results in wasted CPU resources 
• For example, an partition spends its CE waiting for I/O 
– Results in better utilization of the pool 
 May confer the remainder of their timeslice to another VP 
– For example, a VP holding a lock 
 Can be redispatched if they become runnable again during the 
same dispatch interval 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
LPAR 1 
VP 1 
LPAR 1 
VP 1 
LPAR 1 
VP 0 
LPAR 3 
VP 0 
IDLE VP 0 IDLE 
LPAR 3 
VP 1 
0 1 2 3 4 5 6 7 8 9 
LPAR 3 
VP 2 
LPAR 3 
LPAR 1 
VP 0 
LPAR 1 
VP 1 
LPAR 3 
VP 1 
LPAR 2 
VP 0 
LPAR1 
Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped) 
© 2003 Concepts of Solution Design IBM Corporation 
Example 
POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec) 
Physical 
processor 0 
Physical 
processor 1 
10 11 12 13 14 15 16 17 18 19 
LPAR 2 
VP 0 
20 
LPAR 3 
VP 2 
LPAR2 
Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped) 
LPAR3 
Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)
^Eserver pSeries 
POWER Hypervisor and virtual I/O 
 I/O operations without dedicating resources to an individual 
partition 
 POWER Hypervisor’s virtual I/O related operations 
– Provide control and configuration structures for virtual 
adapter images required by the logical partitions 
– Operations that allow partitions controlled and secure access 
to physical I/O adapters in a different partition 
– The POWER Hypervisor does not own any physical I/O 
devices; they are owned by an I/O hosting partition 
 I/O types supported 
– SCSI 
– Ethernet 
– Serial console 
Disk LAN 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Performance monitoring and accounting 
 CPU utilization is measured against CE. 
– An uncapped partition receiving more than its CE will record 
100% but will be using more. 
© 2003 Concepts of Solution Design IBM Corporation 
 SMT 
– Thread priorities compound the variable speed rate. 
– Twice as many logical CPUs. 
 For accounting, interval may be incorrectly allocated. 
– New hardware support is required. 
 Processor utilization register (PURR) records actual clock ticks 
spent executing a partition. 
– Used by performance commands (for example, new flags) and 
accounting modules. 
– Third party tools will need to be modified.
Eserver pSeries 
© 2003 IBM Corporation 
Virtual I/O Server
^Eserver pSeries 
Virtual I/O Server 
 Provides an operating environment for virtual I/O administration 
– Virtual I/O server administration 
– Restricted scriptable command line user interface (CLI) 
 Minimum hardware requirements 
– POWER5 VIO capable machine 
– Hardware management console 
– Storage adapter 
– Physical disk 
– Ethernet adapter 
– At least 128 MB of memory 
 Capabilities of the Virtual I/O Server 
– Ethernet Adapter Sharing 
– Virtual SCSI disk 
• Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific 
models of EMC, HDS, and STK disk subsystems, attached using Fiber Channel 
– Interacts with AIX and Linux partitions 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual I/O Server (Cont.) 
 Installation CD when Advanced POWER 
Virtualization feature is ordered 
 Configuration approaches for high availability 
– Virtual I/O Server 
• LVM mirroring 
• Multipath I/O 
• EtherChannel 
– Second virtual I/O server instance in another partition 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual SCSI 
 Allows sharing of storage devices 
 Vital for shared processor partitions 
– Overcomes potential limit of adapter slots due to Micro- 
Partitioning 
– Allows the creation of logical partitions without the need for 
additional physical resources 
 Allows attachment of previously unsupported storage 
solutions 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
VSCSI server and client architecture overview 
 Virtual SCSI is based on a 
client/server relationship. 
 The virtual I/O resources are assigned 
using an HMC. 
 Virtual SCSI enables sharing of 
adapters as well as disk devices. 
 Dynamic LPAR operations allowed. 
 Dynamic mapping between physical 
Client 
partition 
Linux 
Virtual I/O 
Server partition 
Client 
partition 
AIX 
LVM 
Logical 
volume 1 
VSCSI server 
adapter 
Logical hdisk 
volume 2 
VSCSI server 
adapter 
VSCI client 
adapter 
and virtual resources on the virtual 
I/O server. POWER Hypervisor 
Physical adapter 
hdisk 
VSCI client 
adapter 
Physical disk 
(SCSI, FC) 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Client partition 
hdisk 
LVM 
VSCI client 
adapter 
POWER Hypervisor 
VSCSI server 
adapter 
LVM 
LV 
hdisk 
Virtual I/O Server partition 
Virtual devices 
 Are defined as LVs in the I/O server 
partition 
– Normal LV rules apply 
 Appear as real devices (hdisks) in the 
hosted partition 
 Can be manipulated using Logical 
Volume Manager just like an ordinary 
physical disk 
 Can be used as a boot device and as a 
NIM target 
 Can be shared by multiple clients 
Virtual 
disk 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
SCSI RDMA and Logical Remote Direct Memory Access 
 SCSI transport protocols define the 
rules for exchanging information 
between SCSI initiators and targets. 
 Virtual SCSI uses the SCSI RDMA 
Protocol (SRP). 
– SCSI initiators and targets have the 
ability to directly transfer information 
between their respective address 
spaces. 
 SCSI requests and responses are 
sent using the Virtual SCSI adapters. 
 The actual data transfer, however, is 
done using the Logical Redirected 
DMA protocol. 
Virtual I/O Server 
partition 
Client partition AIX 
VSCI device 
driver (target) 
Device 
Mapping 
VSCI device 
driver (initiator) Data Buffer 
Reliable Command / Response Transport 
Logical Remote Direct Memory Access 
POWER Hypervisor 
Physical 
adapter device 
driver 
Physical adapter 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual SCSI security 
 Only the owning partition has access to its data. 
 Data-information is copied directly from the PCI 
adapter to the client’s memory. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Performance considerations 
 Twice as many processor cycles to do VSCSI as a locally attached 
disk I/O (evenly distributed on the client partition and virtual I/O 
server) 
– The path of each virtual I/O request involves several sources of 
overhead that are not present in a non-virtual I/O request. 
– For a virtual disk backed by the LVM, there is also the performance 
impact of going through the LVM and disk device drivers twice. 
 If multiple partitions are competing for resources from a VSCSI 
server, care must be taken to ensure enough server resources 
(CPU, memory, and disk) are allocated to do the job. 
 If not constrained by CPU performance, dedicated partition 
throughput is comparable to doing local I/O. 
 Because there is no caching in memory on the server I/O partition, 
it's memory requirements should be modest. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Limitations 
 Hosting partition must be available before hosted 
partition boot. 
 Virtual SCSI supports FC, parallel SCSI, and SCSI 
RAID. 
 Maximum of 65535 virtual slots in the I/O server 
partition. 
 Maximum of 256 virtual slots on a single partition. 
 Support for all mandatory SCSI commands. 
 Not all optional SCSI commands are supported. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Implementation guideline 
 Partitions with high performance and disk I/O 
requirements are not recommended for 
implementing VSCSI. 
 Partitions with very low performance and disk I/O 
requirements can be configured at minimum 
expense to use only a portion of a logical volume. 
 Boot disks for the operating system. 
 Web servers that will typically cache a lot of data. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual I/O 
Server 
partition 
Client 
partition 
Virtual I/O 
Server 
partition 
LVM 
VSCSI server 
adapter 
LVM 
VSCSI server 
adapter 
LVM 
VSCSI 
client 
adapter 
VSCSI 
client 
adapter 
POWER Hypervisor 
LVM mirroring 
 This configuration 
protects virtual disks in a 
client partition against 
failure of: 
– One physical disk 
– One physical adapter 
– One virtual I/O server 
 Many possibilities exist 
to exploit this great 
function! 
Physical SCSI 
adapter 
Physical disk 
(SCSI) 
Physical SCSI 
adapter 
Physical disk 
(SCSI) 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual I/O 
Server 
partition 
Client 
partition 
Virtual I/O 
Server 
partition 
LVM 
(hdisk) 
VSCSI server 
adapter 
LVM 
(hdisk) 
VSCSI server 
adapter 
LVM 
VSCSI 
client 
adapter 
VSCSI 
client 
adapter 
POWER Hypervisor 
Multipath I/O 
 This configuration protects 
virtual disks in a client 
partition against failure of: 
– Failure of one physical FC 
adapter in one I/O server 
– Failure of one Virtual I/O 
server 
 Physical disk is assigned as a 
whole to the client partition 
 Many possibilities exist to 
exploit this great function! 
Physical FC adapter 
Physical FC adapter 
SAN Switch 
Physical disk 
ESS 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual LAN overview 
 Virtual network segments on top of 
physical switch devices. 
 All nodes in the VLAN can 
communicate without any L3 
routing or inter-VLAN bridging. 
 VLANs provides: 
– Increased LAN security 
– Flexible network deployment over 
traditional network devices 
 VLAN support in AIX is based on 
the IEEE 802.1Q VLAN 
implementation. 
– VLAN ID tagging to Ethernet 
frames 
– VLAN ID restricted switch ports 
Node A-1 Node A-2 
Switch A 
Switch B Switch C 
VLAN 1 
VLAN 2 
X 
Node B-1 Node B-2 Node B-3 Node C-1 Node C-2 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual Ethernet 
 Enables inter-partition communication. 
– In-memory point to point connections 
 Physical network adapters are not needed. 
 Similar to high-bandwidth Ethernet connections. 
 Supports multiple protocols (IPv4, IPv6, and ICMP). 
 No Advanced POWER Virtualization feature required. 
– POWER5 Systems 
– AIX 5L V5.3 or appropriate Linux level 
– Hardware management console (HMC) 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual Ethernet connections 
 VLAN technology implementation 
– Partitions can only access data directed to 
them. 
 Virtual Ethernet switch provided by the 
POWER Hypervisor 
 Virtual LAN adapters appears to the OS as 
physical adapters 
– MAC-Address is generated by the HMC. 
 1-3 Gb/s transmission speed 
– Support for large MTUs (~64K) on AIX. 
 Up to 256 virtual Ethernet adapters 
– Up to 18 VLANs. 
 Bootable device support for NIM OS 
installations 
Linux 
partition 
AIX 
partition 
Virtual 
Ethernet 
adapter 
Virtual 
Ethernet 
adapter 
AIX 
partition 
Virtual 
Ethernet 
adapter 
Virtual Ethernet switch 
POWER Hypervisor 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual Ethernet switch 
 Based on IEEE 802.1Q VLAN standard 
– OSI-Layer 2 
– Optional Virtual LAN ID (VID) 
– 4094 virtual LANs supported 
– Up to 18 VIDs per virtual LAN port 
 Switch configuration through HMC 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
How it works 
Virtual Ethernet adapter 
Virtual VLAN switch port 
PHYP caches source MAC 
Y 
IEEE VLAN Check VLAN header 
header? 
N 
Insert VLAN header 
Port allowed? 
N 
Dest. MAC in 
table? 
Y 
Trunk adapter 
defined? 
Configured associated switch 
port 
Match for 
VLAN Nr. in 
table? 
N 
Y 
N 
Y 
Deliver Pass to Trunk 
N 
Y 
N 
adapter Drop packet 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Performance considerations 
 Virtual Ethernet performance 
– Throughput scales nearly linear with the 
allocated capacity entitlement 
 Virtual LAN vs. Gigabit Ethernet 
throughput 
– Virtual Ethernet adapter has higher raw 
throughput at all MTU sizes 
– In-memory copy is more efficient at larger 
MTU 
Throughput/0.1 
entitlement 
Throughput per 0.1 entitlement 
1000 
800 
600 
400 
200 
0 
[Mb/s] 
0.1 0.3 0.5 0.8 1 
65394 
9000 
1500 
CPU entitlements 
MTU 
size 
Throughput, TCP_STREAM 
10000 
8000 
6000 
4000 
2000 
0 
Throughput 
[Mb/s] 
1 
VLAN 
Gb Ethernet 
MTU 1500 1500 9000 9000 65394 65394 
Simpl./Dupl. S D S D S D 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Limitations 
 Virtual Ethernet can be used in both shared and 
dedicated processor partitions provided with the 
appropriate OS levels. 
 A mixture of Virtual Ethernet connections, real network 
adapters, or both are permitted within a partition. 
 Virtual Ethernet can only connect partitions within a 
single system. 
 A system’s processor load is increased when using 
virtual Ethernet. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Implementation guideline 
 Know your environment and the network traffic. 
 Choose a high MTU size, as it makes sense for the 
network traffic in the Virtual LAN. 
 Use the MTU size 65394 if you expect a large amount of 
data to be copied inside your Virtual LAN. 
 Enable tcp_pmtu_discover and udp_pmtu_discover in 
conjunction with MTU size 65394. 
 Do not turn off SMT. 
 No dedicated CPUs are required for virtual Ethernet 
performance. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Connecting Virtual Ethernet to external networks 
 Routing 
– The partition that routes the traffic to the external work does not necessarily have to be 
the virtual I/O server. 
AIX 
partition 
3.1.1.10 3.1.1.10 
AIX partition 
1.1.1.100 3.1.1.1 
Virtual Ethernet switch 
POWER Hypervisor 
Linux 
partition 
Physical adapter 
AIX 
partition 
4.1.1.10 4.1.1.11 
AIX partition 
2.1.1.100 4.1.1.1 
Virtual Ethernet switch 
POWER Hypervisor 
Linux 
partition 
Physical adapter 
IP subnet 1.1.1.X 
AIX 
Server 
IP subnet 2.1.1.X 
Linux 
Server 
IP Router 
1.1.1.1 
2.1.1.1 
1.1.1.10 2.1.1.10 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared Ethernet Adapter 
 Connects internal and external VLANs using one physical 
adapter. 
 SEA is a new service that acts as a layer 2 network switch. 
– Securely bridges network traffic from a virtual Ethernet 
adapter to a real network adapter 
 SEA service runs in the Virtual I/O Server partition. 
– Advanced POWER Virtualization feature required 
– At least one physical Ethernet adapter required 
 No physical I/O slot and network adapter required in the 
client partition. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared Ethernet Adapter (Cont.) 
 Virtual Ethernet MAC are visible to outside systems. 
 Broadcast/multicast is supported. 
 ARP (Address Resolution Protocol) and NDP (Neighbor Discovery 
Protocol) can work across a shared Ethernet. 
 One SEA can be shared by multiple VLANs and multiple subnets 
can connect using a single adapter on the Virtual I/O Server. 
 Virtual Ethernet adapter configured in the Shared Ethernet Adapter 
must have the trunk flag set. 
– The trunk Virtual Ethernet adapter enables a layer-2 bridge to a 
physical adapter 
 IP fragmentation is performed or an ICMP packet too big message 
is sent when the shared Ethernet adapter receives IP (or IPv6) 
packets that are larger than the MTU of the adapter that the packet 
is forwarded through. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual Ethernet and Shared Ethernet Adapter security 
 VLAN (virtual local area network) tagging description taken 
from the IEEE 802.1Q standard. 
 The implementation of this VLAN standard ensures that the 
partitions have no access to foreign data. 
 Only the network adapters (virtual or physical) that are 
connected to a port (virtual or physical) that belongs to the 
same VLAN can receive frames with that specific VLAN ID. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Performance considerations 
 Virtual I/O-Server 
performance 
– Adapters stream data at 
media speed if the Virtual 
I/O server has enough 
capacity entitlement. 
– CPU utilization per Gigabit 
of throughput is higher with 
a Shared Ethernet adapter. 
Throughput 
[Mb/s] 
2000 
1500 
1000 
500 
0 
Virtual I/O Server Throughput, TCP_STREAM 
1 2 3 4 
MTU 1500 1500 9000 9000 
Simplex/Duplex simplex duplex simplex duplex 
CPU 
Utilisation 
[%cpu/Gb] 
100 
80 
60 
40 
20 
0 
Virtual I/O Server 
normalized CPU utilisation, TCP_STREAM 
1 2 3 4 
MTU 1500 1500 9000 9000 
Simplex/Duplex simplex duplex simplex duplex 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Limitations 
 System processors are used for all communication 
functions, leading to a significant amount of system 
processor load. 
 One of the virtual adapters in the SEA on the Virtual I/O 
server must be defined as a default adapter with a default 
PVID. 
 Up to 16 Virtual Ethernet adapters with 18 VLANs on each 
can be shared on a single physical network adapter. 
 Shared Ethernet Adapter requires: 
– POWER Hypervisor component of POWER5 
systems 
– AIX 5L Version 5.3 or appropriate Linux level 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Implementation guideline 
 Know your environment and the network traffic. 
 Use a dedicated network adapter if you expect heavy 
network traffic between Virtual Ethernet and local 
networks. 
 If possible, use dedicated CPUs for the Virtual I/O 
Server. 
 Choose 9000 for MTU size, if this makes sense for 
your network traffic. 
 Don’t use Shared Ethernet Adapter functionality for 
latency critical applications. 
 With MTU size 1500, you need about 1 CPU per 
gigabit Ethernet adapter streaming at media speed. 
 With MTU size 9000, 2 Gigabit Ethernet adapters can 
stream at media speed per CPU. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared Ethernet Adapter configuration 
 The Virtual I/O Server is 
configured with at least one 
physical Ethernet adapter. 
 One Shared Ethernet Adapter 
can be shared by multiple 
VLANs. 
 Multiple subnets can connect 
using a single adapter on the 
Virtual I/O Server. 
AIX 
partition 
VLAN 1 
10.1.1.11 
Virtual I/O Server 
Shared Ethernet Adapter 
ent0 VLAN 2 
VLAN 1 
Virtual Ethernet switch 
POWER Hypervisor 
Linux 
partition 
VLAN 2 
10.1.2.11 
Physical adapter 
VLAN 1 
AIX 
Server 
10.1.1.14 
VLAN 2 
Linux 
Server 
10.1.2.15 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Multiple Shared Ethernet Adapter configuration 
 Maximizing throughput 
– Using several Shared Ethernet 
Adapters 
– More queues 
– More performance 
Linux 
partition 
AIX 
partition 
VLAN 2 
10.1.2.11 
VLAN 1 
10.1.1.11 
Virtual Ethernet switch 
POWER Hypervisor 
Virtual I/O Server 
Shared Ethernet Adapter 
VLAN 
VLAN 
2 
ent0 ent1 
1 
Physical adapter 
VLAN 1 
AIX 
Server 
10.1.1.14 
VLAN 2 
Linux 
Server 
10.1.2.15 
Physical adapter 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Multipath routing with dead gateway detection 
 This configuration protects 
your access to the external 
network against: 
– Failure of one physical 
network adapter in one I/O 
server 
– Failure of one Virtual I/O 
server 
– Failure of one gateway 
AIX partition 
Virtual I/O 
Server 2 
Shared Ethernet Adapter 
VLAN 2 
9.3.5.21 
Virtual I/O 
Server 2 
Shared Ethernet Adapter 
VLAN 1 
9.3.5.11 
Multipath routing 
with 
dead gateway 
detection 
default route to 9.3.5.10 via 9.3.5.12 
default route to 9.3.5.20 via 9.3.5.22 
VLAN 2 
9.3.5.22 
VLAN 1 
9.3.5.12 
Virtual Ethernet switch 
POWER Hypervisor 
ent0 
Physical adapter 
External 
network 
ent0 
Physical adapter 
Gateway 
9.3.5.10 
Gateway 
9.3.5.20 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Shared Ethernet Adapter commands 
 Virtual I/O Server commands 
– lsdev -type adapter: Lists all the virtual and physical adapters. 
– Choose the virtual Ethernet adapter we want to map to the physical 
Ethernet adapter. 
– Make sure the physical and virtual interfaces are unconfigured 
(down or detached). 
– mkvdev: Maps the physical adapter to the virtual adapter, creates a 
layer 2 bridge, and defines the default virtual adapter with its default 
VLAN ID. It creates a new Ethernet interface (for example, ent5). 
– The mktcpip command is used for TCP/IP configuration on the new 
Ethernet interface (for example, ent5). 
 Client partition commands 
– No new commands are needed; the typical TCP/IP configuration is 
done on the virtual Ethernet interface that it is defined in the client 
partition profile on the HMC. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Virtual SCSI commands 
 Virtual I/O Server commands 
– To map a LV: 
• mkvg: Creates the volume group, where a new LV will be created using 
the mklv command. 
• lsdev: Shows the virtual SCSI server adapters that could be used for 
mapping with the LV. 
• mkvdev: Maps the virtual SCSI server adapter to the LV. 
• lsmap -all: Shows the mapping information. 
– To map a physical disk: 
• lsdev: Shows the virtual SCSI server adapters that could be used for 
mapping with a physical disk. 
• mkvdev: Maps the virtual SCSI server adapter to a physical disk. 
• lsmap -all: Shows the mapping information. 
 Client partition commands 
– No new commands needed; the typical device configuration uses 
the cfgmgr command. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Section Review Questions 
1. Any technology improvement will boost 
performance of any client solution. 
a. True 
b. False 
2. The application of technology in a creative way 
to solve client’s business problems is one 
definition of innovation. 
a. True 
b. False 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Section Review Questions 
3. Client’s satisfaction with your solution can be 
enhanced by which of the following? 
a. Setting expectations appropriately. 
b. Applying technology appropriately. 
c. Communicating the benefits of the technology to the 
client. 
d. All of the above. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Section Review Questions 
4. Which of the following are available with 
POWER5 architecture? 
a. Simultaneous Multi-Threading. 
b. Micro-Partitioning. 
c. Dynamic power management. 
d. All of the above. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Section Review Questions 
5. Simultaneous Multi-Threading is the same as 
hyperthreading, IBM just gave it a different 
name. 
a. True. 
b. False. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Section Review Questions 
6. In order to bridge network traffic between the 
Virtual Ethernet and external networks, the 
Virtual I/O Server has to be configured with at 
least one physical Ethernet adapter. 
a. True. 
b. False. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Review Question Answers 
© 2003 Concepts of Solution Design IBM Corporation 
1. b 
2. a 
3. d 
4. d 
5. b 
6. a
^Eserver pSeries 
Unit Summary 
 You should now be able to: 
– Describe the relationship between technology and 
solutions. 
– List key IBM technologies that are part of the POWER5 
products. 
– Be able to describe the functional benefits that these 
technologies provide. 
– Be able to discuss the appropriate use of these 
technologies. 
© 2003 Concepts of Solution Design IBM Corporation
^Eserver pSeries 
Reference 
 You may find more information here: 
IBM eServer pSeries AIX 5L Support for Micro-Partitioning 
and Simultaneous Multi-threading White Paper 
Introduction to Advanced POWER Virtualization on IBM 
eServer p5 Servers SG24-7940 
IBM eServer p5 Virtualization – Performance 
Considerations SG24-5768 
© 2003 Concepts of Solution Design IBM Corporation

More Related Content

What's hot

Parallel Sysplex Implement2
Parallel Sysplex Implement2Parallel Sysplex Implement2
Parallel Sysplex Implement2ggddggddggdd
 
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Eric Van Hensbergen
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldAllan Cantle
 
Presentation oracle on power power advantages and license optimization
Presentation   oracle on power power advantages and license optimizationPresentation   oracle on power power advantages and license optimization
Presentation oracle on power power advantages and license optimizationsolarisyougood
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 

What's hot (13)

Sakar jain
Sakar jainSakar jain
Sakar jain
 
Parallel Sysplex Implement2
Parallel Sysplex Implement2Parallel Sysplex Implement2
Parallel Sysplex Implement2
 
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
Ludden q3 2008_boston
Ludden q3 2008_bostonLudden q3 2008_boston
Ludden q3 2008_boston
 
Ludden power7 verification
Ludden power7 verificationLudden power7 verification
Ludden power7 verification
 
101 cd 1415-1445
101 cd 1415-1445101 cd 1415-1445
101 cd 1415-1445
 
Eldo_Premier_2015
Eldo_Premier_2015Eldo_Premier_2015
Eldo_Premier_2015
 
Presentation oracle on power power advantages and license optimization
Presentation   oracle on power power advantages and license optimizationPresentation   oracle on power power advantages and license optimization
Presentation oracle on power power advantages and license optimization
 
Larrabee
LarrabeeLarrabee
Larrabee
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Ph.D. Thesis presentation
Ph.D. Thesis presentationPh.D. Thesis presentation
Ph.D. Thesis presentation
 

Similar to Technology (1)

Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Sal Marcus
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIAllan Cantle
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...Vaibhav R
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddinSal Marcus
 
Presentation best practices for optimal configuration of oracle databases o...
Presentation   best practices for optimal configuration of oracle databases o...Presentation   best practices for optimal configuration of oracle databases o...
Presentation best practices for optimal configuration of oracle databases o...xKinAnx
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex systemIbm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex systemIBM Switzerland
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 

Similar to Technology (1) (20)

Technology
TechnologyTechnology
Technology
 
Technology
TechnologyTechnology
Technology
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
11136442.ppt
11136442.ppt11136442.ppt
11136442.ppt
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMI
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
 
Presentation best practices for optimal configuration of oracle databases o...
Presentation   best practices for optimal configuration of oracle databases o...Presentation   best practices for optimal configuration of oracle databases o...
Presentation best practices for optimal configuration of oracle databases o...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
Palestra IBM-Mack Zvm linux
Palestra  IBM-Mack Zvm linux  Palestra  IBM-Mack Zvm linux
Palestra IBM-Mack Zvm linux
 
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex systemIbm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
Ibm symp14 referent_marcus alexander mac dougall_ibm x6 und flex system
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 

Technology (1)

  • 1. Eserver pSeries "Any sufficiently advanced technology will have the appearance of magic." …Arthur C. Clarke © 2003 IBM Corporation Section 2: The Technology
  • 2. ^Eserver pSeries Section Objectives  On completion of this unit you should be able to: – Describe the relationship between technology and solutions. – List key IBM technologies that are part of the POWER5 products. – Be able to describe the functional benefits that these technologies provide. – Be able to discuss the appropriate use of these technologies. © 2003 Concepts of Solution Design IBM Corporation
  • 3. ^Eserver pSeries IBM and Technology Solutions Products Technology Science © 2003 Concepts of Solution Design IBM Corporation
  • 4. ^Eserver pSeries Technology and innovation  Having technology available is a necessary first step.  Finding creative new ways to use the technology for the benefit of our clients is what innovation is about.  Solution design is an opportunity for innovative application of technology. © 2003 Concepts of Solution Design IBM Corporation
  • 5. ^Eserver pSeries When technology won’t ‘fix’ the problem  When the technology is not related to the problem.  When the client has unreasonable expectations. © 2003 Concepts of Solution Design IBM Corporation
  • 6. Eserver pSeries © 2003 IBM Corporation POWER5 Technology
  • 7. ^Eserver pSeries POWER4 and POWER5 Cores POWER4 Core POWER5 Core © 2003 Concepts of Solution Design IBM Corporation
  • 8. ^Eserver pSeries POWER5  Designed for entry and high-end servers  Enhanced memory subsystem  Improved performance  Simultaneous Multi-Threading  Hardware support for Shared Processor Partitions (Micro- Partitioning)  Dynamic power management  Compatibility with existing POWER4 systems  Enhanced reliability, availability, serviceability SSMMTT CCoorree SSMMTT CCoorree 11..99 MMBB LL22 CCaacchhee Enhanced distributed switch LL33 DDiirr MMeemm CCttrrll Chip-Chip / MCM-MCM / SMPLink GX+ © 2003 Concepts of Solution Design IBM Corporation
  • 9. ^Eserver pSeries Enhanced memory subsystem  Improved L1 cache design – 2-way set associative i-cache – 4-way set associative d-cache – New replacement algorithm (LRU vs. FIFO)  Larger L2 cache – 1.9 MB, 10-way set associative  Improved L3 cache design – 36 MB, 12-way set associative – L3 on the processor side of the fabric – Satisfies L2 cache misses more frequently – Avoids traffic on the interchip fabric  On-chip L3 directory and memory controller – L3 directory on the chip reduces off-chip delays after an L2 miss – Reduced memory latencies  Improved pre-fetch algorithms SSMMTT CCoorree SSMMTT CCoorree 11..99 MMBB LL22 CCaacchhee Enhanced distributed switch LL33 DDiirr MMeemm CCttrrll Chip-Chip / MCM-MCM / SMPLink © 2003 Concepts of Solution Design IBM Corporation
  • 10. ^Eserver pSeries Enhanced memory subsystem POWER4 system structure POWER5 system structure PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr PPrroocceessssoorr © 2003 Concepts of Solution Design IBM Corporation L3 Cache L3 Cache L2 Cache L2 Cache L2 Cache L2 Cache Fabric controller Fabric controller Memory controller Memory controller Memory Memory L3 Cache L3 Cache Fabric controller Fabric controller Memory controller Memory controller Memory Memory Reduced L3 latency Faster access to memory Larger SMPs 64-way Number of chips cut in half LL33 DDiirr rrii DD 33LL
  • 11. ^Eserver pSeries Simultaneous Multi-Threading (SMT)  What is it?  Why would I want it? © 2003 Concepts of Solution Design IBM Corporation
  • 12. ^Eserver pSeries Out-of-order processing Branch pipeline Load/store pipeline MP ISS RF EX WB Xfer MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX WB Xfer Fixed-point pipeline Floating-point WB pipeline POWER4 pipeline MP ISS RF F6 Xfer F6 F6 F6 F6 F6 D1 D2 D3 Xfer GD Branch redirects Instruction Fetch IC IF BP CP Instruction Crack and Group Formation D0 Interrupts & Flushes POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit) POWER5 pipeline © 2003 Concepts of Solution Design IBM Corporation
  • 13. ^Eserver pSeries  Execution unit utilization is low in today’s microprocessors  25% of average execution unit utilization across a broad spectrum of environments FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles Multi-threading evolution ehcaC-i Memory Instruction streams Next evolution step © 2003 Concepts of Solution Design IBM Corporation
  • 14. ^Eserver pSeries  Two instruction streams, one thread at any instance  Hardware swaps in second thread when long-latency event FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Coarse-grained multi-threading  Swap requires several cycles Swap Swap Processor Cycles occurs ehcaC-i Memory Instruction streams Swap Next evolution step © 2003 Concepts of Solution Design IBM Corporation
  • 15. ^Eserver pSeries Coarse-grained multi-threading (Cont.)  Processor (for example, RS64-IV) is able to store context for two threads – Rapid switching between threads minimizes lost cycles due to I/O waits and cache misses. – Can yield ~20% improvement for OLTP workloads.  Coarse-grained multi-threading only beneficial where number of active threads exceeds 2x number of CPUs – AIX must create a “dummy” thread if there are insufficient numbers of real threads. • Unnecessary switches to “dummy” threads can degrade performance ~20% • Does not work with dynamic CPU deallocation © 2003 Concepts of Solution Design IBM Corporation
  • 16. ^Eserver pSeries  Variant of coarse-grained multi-threading  Thread execution in round-robin fashion  Cycle remains unused when a thread encounters a long-latency event FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles Fine-grained multi-threading ehcaC-i Memory Instruction streams Next evolution step © 2003 Concepts of Solution Design IBM Corporation
  • 17. ^Eserver pSeries POWER5 pipeline Out-of-order processing Branch pipeline MP ISS RF EX WB Xfer MP ISS RF EA DC WB Xfer MP ISS RF EX WB Xfer MP ISS RF F6 Xfer F6 F6 F6 F6 F6 D1 D2 D3 Xfer GD Branch redirects Instruction Fetch IC IF BP CP Instruction Crack and Group Formation D0 Interrupts & Flushes WB Fmt POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit) © 2003 Concepts of Solution Design IBM Corporation IF CP Load/store pipeline Fixed-point pipeline Floating-point pipeline POWER4 pipeline
  • 18. ^Eserver pSeries  Reduction in unused execution units results in a 25-40% boost and even more! FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Simultaneous multi-threading (SMT) Processor Cycles ehcaC-i Memory Instruction streams First evolution step © 2003 Concepts of Solution Design IBM Corporation
  • 19. ^Eserver pSeries Simultaneous multi-threading (SMT) (Cont.)  Each chip appears as a 4-way SMP to software – Allows instructions from two threads to execute simultaneously  Processor resources optimized for enhanced SMT performance – No context switching, no dummy threads  Hardware, POWER Hypervisor, or OS controlled thread priority – Dynamic feedback of shared resources allows for balanced thread execution  Dynamic switching between single and multithreaded mode © 2003 Concepts of Solution Design IBM Corporation
  • 20. ^Eserver pSeries Dynamic resource balancing  Threads share many resources – Global Completion Table, Branch History Table, Translation Lookaside Buffer, and so on  Higher performance realized when resources balanced across threads – Tendency to drift toward extremes accompanied by reduced performance © 2003 Concepts of Solution Design IBM Corporation
  • 21. ^Eserver pSeries Adjustable thread priority  Instances when unbalanced execution is desirable – No work for opposite thread – Thread waiting on lock – Software determined non uniform balance – Power management  Control instruction decode rate – Software/hardware controls eight priority levels for each thread 22 cycle 1 per 1 Instructions 1 1 1 0 0 0 Single-threaded operation 0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1 Thread 0 Priority - Thread 1 Priority Thread 0 IPC Thread 1 IPC Power Save Mode Hardware thread priorities © 2003 Concepts of Solution Design IBM Corporation
  • 22. ^Eserver pSeries Single-threaded operation  Advantageous for execution unit limited applications – Floating or fixed point intensive workloads  Execution unit limited applications provide minimal performance leverage for SMT – Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread  Determined dynamically on a per processor basis Thread states Dormant Hardware or Software Software Null Software Active Software © 2003 Concepts of Solution Design IBM Corporation
  • 23. Eserver pSeries © 2003 IBM Corporation Micro-Partitioning
  • 24. ^Eserver pSeries Micro-Partitioning overview  Mainframe inspired technology  Virtualized resources shared by multiple partitions  Benefits – Finer grained resource allocation – More partitions (Up to 254) – Higher resource utilization  New partitioning model – POWER Hypervisor – Virtual processors – Fractional processor capacity partitions – Operating system optimized for Micro-Partitioning exploitation – Virtual I/O © 2003 Concepts of Solution Design IBM Corporation
  • 25. ^Eserver pSeries Processor terminology Shared processor partition SMT Off Shared processor partition SMT On Entitled capacity Shared processor pool Dedicated processor partition SMT Off Logical (SMT) Virtual Shared Dedicated Inactive (CUoD) Deconfigured Installed physical processors © 2003 Concepts of Solution Design IBM Corporation
  • 26. ^Eserver pSeries Shared processor partitions  Micro-Partitioning allows for multiple partitions to share one physical processor  Up to 10 partitions per physical processor  Up to 254 partitions active at the same time  Partition’s resource definition – Minimum, desired, and maximum values for each resource – Processor capacity – Virtual processors – Capped or uncapped • Capacity weight – Dedicated memory • Minimum of 128 MB and 16 MB increments – Physical or virtual I/O resources CPU 0 CPU 1 CPU 3 CPU 4 LPAR 1 LPAR 2 LPAR 3 LPAR 4 LPAR 5 LPAR 6 © 2003 Concepts of Solution Design IBM Corporation
  • 27. ^Eserver pSeries Understanding min/max/desired resource values  The desired value for a resource is given to a partition if enough resource is available.  If there is not enough resource to meet the desired value, then a lower amount is allocated.  If there is not enough resource to meet the min value, the partition will not start.  The maximum value is only used as an upper limit for dynamic partitioning operations. © 2003 Concepts of Solution Design IBM Corporation
  • 28. ^Eserver pSeries Partition capacity entitlement  Processing units – 1.0 processing unit represents one physical processor  Entitled processor capacity – Commitment of capacity that is reserved for the partition – Set upper limit of processor utilization for capped partitions – Each virtual processor must be granted at least 1/10 of a processing unit of entitlement  Shared processor capacity is always delivered in terms of whole physical processors Minimum requirement 0.1 processing units 0.5 processing unit 0.4 processing unit Processing capacity 1 physical processor 1.0 processing units © 2003 Concepts of Solution Design IBM Corporation
  • 29. ^Eserver pSeries Capped and uncapped partitions  Capped partition – Not allowed to exceed its entitlement  Uncapped partition – Is allowed to exceed its entitlement  Capacity weight – Used for prioritizing uncapped partitions – Value 0-255 – Value of 0 referred to as a “soft cap” © 2003 Concepts of Solution Design IBM Corporation
  • 30. ^Eserver pSeries Partition capacity entitlement example  Shared pool has 2.0 processing units available  LPARs activated in sequence  Partition 1 activated – Min = 1.0, max = 2.0, desired = 1.5 – Starts with 1.5 allocated processing units  Partition 2 activated – Min = 1.0, max = 2.0, desired = 1.0 – Does not start  Partition 3 activated – Min = 0.1, max = 1.0, desired = 0.8 – Starts with 0.5 allocated processing units © 2003 Concepts of Solution Design IBM Corporation
  • 31. ^Eserver pSeries Understanding capacity allocation – An example  A workload is run under different configurations.  The size of the shared pool (number of physical processors) is fixed at 16.  The capacity entitlement for the partition is fixed at 9.5.  No other partitions are active. © 2003 Concepts of Solution Design IBM Corporation
  • 32. ^Eserver pSeries Uncapped – 16 virtual processors Uncapped (16PPs/16VPs/9.5CE) 15 10 5  16 virtual processors.  Uncapped.  Can use all available resource.  The workload requires 26 minutes to complete. © 2003 Concepts of Solution Design IBM Corporation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapsed time
  • 33. ^Eserver pSeries Uncapped – 12 virtual processors Uncapped (16PPs/12VPs/9.5CE) 15 10 5  12 virtual processors.  Even though the partition is uncapped, it can only use 12 processing units.  The workload now requires 27 minutes to complete. © 2003 Concepts of Solution Design IBM Corporation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapsed time
  • 34. ^Eserver pSeries Capped (16PPs/12VPs/9.5E) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapses time © 2003 Concepts of Solution Design IBM Corporation Capped 15 10 5 0  The partition is now capped and resource utilization is limited to the capacity entitlement of 9.5. – Capping limits the amount of time each virtual processor is scheduled. – The workload now requires 28 minutes to complete.
  • 35. ^Eserver pSeries Dynamic partitioning operations  Add, move, or remove processor capacity – Remove, move, or add entitled shared processor capacity – Change between capped and uncapped processing – Change the weight of an uncapped partition – Add and remove virtual processors • Provided CE / VP > 0.1  Add, move, or remove memory – 16 MB logical memory block  Add, move, or remove physical I/O adapter slots  Add or remove virtual I/O adapter slots  Min/max values defined for LPARs set the bounds within which DLPAR can work © 2003 Concepts of Solution Design IBM Corporation
  • 36. ^Eserver pSeries Dynamic LPAR Standard on all new systems Part#1 Production AIX 5L HMC Part#2 Part#3 Part#4 Test/ Dev Move resources between live AIX 5L File/ Print Linux Legacy Apps partitions AIX 5L Hypervisor © 2003 Concepts of Solution Design IBM Corporation
  • 37. Eserver pSeries © 2003 IBM Corporation Firmware POWER Hypervisor
  • 38. ^Eserver pSeries POWER Hypervisor strategy  New Hypervisor for POWER5 systems – Further convergence with iSeries – But brands will retain unique value propositions – Reduced development effort – Faster time to market  New capabilities on pSeries servers – Shared processor partitions – Virtual I/O  New capability on iSeries servers – Can run AIX 5L © 2003 Concepts of Solution Design IBM Corporation
  • 39. ^Eserver pSeries POWER Hypervisor component sourcing H-Call Interface Nucleus (SLIC) Virtual I/O Bus recovery Dump Drawer concurrent maint. Slot/tower concurrent maint. Shared processor LPAR Capacity on Demand Virtual Ethernet Location codes Load from flash Message passing LAN IOA VLAN IOA FSP NVRAM HSC VVLLAANN © 2003 Concepts of Solution Design IBM Corporation pSeries iSeries 255 partitions Partition on demand HMC SCSI IOA I/O configuration
  • 40. ^Eserver pSeries POWER Hypervisor functions  Same functions as POWER4 Hypervisor. – Dynamic LPAR – Capacity Upgrade on Demand  New, active functions. – Dynamic Micro-Partitioning – Shared processor pool – Virtual I/O – Virtual LAN  Machine is always in LPAR mode. Dynamic LPAR – Even with all resources dedicated to one OS Dynamic Micro-Partitioning CPU 0 CPU 1 CPU 2 CPU 3 Shared processor pools SSMMTT C Coroere SSMMT TC Coorere LL33 D Dirir MMeemm C Ctrtlrl SSMMTT C Coorere 11.9.9 M MBB L L22 C Caacchhee Enhanced distributed switch Chip-Chip / MCM-MCM / SMPLink 11.9. 9M MBB L L22 C Caacchhee SSMMTT C Coorere Enhanced distributed switch Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl SSMMTT C Coorere LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink SSMMTT C Coorere SSMMTT C Coorere 11.9. 9M MBB L L22 C Caacchhee 11.9.9 M MBB L L22 C Caacchhee SSMMTT C Coorere Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink Virtual I/O Disk LAN Capacity Upgrade on Demand Planned Actual Client Capacity Growth © 2003 Concepts of Solution Design IBM Corporation
  • 41. ^Eserver pSeries POWER Hypervisor implementation  Design enhancements to previous POWER4 implementation enable the sharing of processors by multiple partitions – Hypervisor decrementer (HDECR) – New Processor Utilization Resource Register (PURR) – Refine virtual processor objects • Does not include physical characteristics of the processor – New Hypervisor calls © 2003 Concepts of Solution Design IBM Corporation
  • 42. ^Eserver pSeries POWER Hypervisor processor dispatch  Manage a set of processors on the machine (shared processor pool).  POWER5 generates a 10 ms dispatch window. – Minimum allocation is 1 ms per physical processor.  Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. – ms/VP = CE * 10 / VPs  The partition entitlement is evenly distributed among the online virtual processors.  Once a capped partition has received its CE within a dispatch interval, it becomes not-runnable.  A VP dispatched within 1 ms of the end of the Virtual processor capacity entitlement for six shared processor partitions CPU 0 CPU 1 SSMMTT C Coorere SSMMTT C Coorere SSMMTT C Coorere LL33 D Dirir MMeemm C Ctrtlrl SSMMTT C Coroere 11.9.9 M MBB L L22 C Caacchhee 11.9.9 M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Enhanced distributed switch Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink SSMMTT C Coorere SSMMTT C Coroere SSMMTT C Coorere LL33 D Dirir MMeemm C Ctrtlrl SSMMTT C Coorere 11.9.9 M MBB L L22 C Caacchhee 11.9.9 M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Enhanced distributed switch Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink CPU 2 CPU 3 POWER Hypervisor’s processor dispatch dispatch interval will receive half its CE at the start of the next dispatch interval. Shared processor pool © 2003 Concepts of Solution Design IBM Corporation
  • 43. ^Eserver pSeries Dispatching and interrupt latencies  Virtual processors have dispatch latency.  Dispatch latency is the time between a virtual processor becoming runnable and being actually dispatched.  Timers have latency issues also.  External interrupts have latency issues also. © 2003 Concepts of Solution Design IBM Corporation
  • 44. ^Eserver pSeries Shared processor pool  Processors not associated with dedicated processor partitions.  No fixed relationship between virtual processors and physical processors.  The POWER Hypervisor attempts to use the same physical processor. – Affinity scheduling – Home node Virtual processor capacity entitlement for six shared processor partitions SSMMTT C Coorere SSMMTT C Coorere 11.9. 9M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink SSMMTT C Coorere SSMMTT C Coorere 11.9.9 M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink SSMMT TC Coorere SSMMTT C Coorere 11.9. 9M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink POWER Hypervisor’s processor dispatch SSMMTT C Coorere SSMMTT C Coorere 11.9. 9M MBB L L22 C Caacchhee Enhanced distributed switch LL33 D Dirir MMeemm C Ctrtlrl Chip-Chip / MCM-MCM / SMPLink CPU 0 CPU 1 CPU 2 CPU 3 Shared processor pool © 2003 Concepts of Solution Design IBM Corporation
  • 45. ^Eserver pSeries Affinity scheduling  When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using: – Same physical processor as before, or – Same chip, or – Same MCM  When a physical processor becomes idle, the POWER Hypervisor looks for a runnable VP that: – Has affinity for it, or – Has affinity to no-one, or – Is uncapped  Similar to AIX affinity scheduling © 2003 Concepts of Solution Design IBM Corporation
  • 46. ^Eserver pSeries Operating system support  Micro-Partitioning capable operating systems need to be modified to cede a virtual processor when they have no runnable work – Failure to do this results in wasted CPU resources • For example, an partition spends its CE waiting for I/O – Results in better utilization of the pool  May confer the remainder of their timeslice to another VP – For example, a VP holding a lock  Can be redispatched if they become runnable again during the same dispatch interval © 2003 Concepts of Solution Design IBM Corporation
  • 47. ^Eserver pSeries LPAR 1 VP 1 LPAR 1 VP 1 LPAR 1 VP 0 LPAR 3 VP 0 IDLE VP 0 IDLE LPAR 3 VP 1 0 1 2 3 4 5 6 7 8 9 LPAR 3 VP 2 LPAR 3 LPAR 1 VP 0 LPAR 1 VP 1 LPAR 3 VP 1 LPAR 2 VP 0 LPAR1 Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped) © 2003 Concepts of Solution Design IBM Corporation Example POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec) Physical processor 0 Physical processor 1 10 11 12 13 14 15 16 17 18 19 LPAR 2 VP 0 20 LPAR 3 VP 2 LPAR2 Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped) LPAR3 Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)
  • 48. ^Eserver pSeries POWER Hypervisor and virtual I/O  I/O operations without dedicating resources to an individual partition  POWER Hypervisor’s virtual I/O related operations – Provide control and configuration structures for virtual adapter images required by the logical partitions – Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition – The POWER Hypervisor does not own any physical I/O devices; they are owned by an I/O hosting partition  I/O types supported – SCSI – Ethernet – Serial console Disk LAN © 2003 Concepts of Solution Design IBM Corporation
  • 49. ^Eserver pSeries Performance monitoring and accounting  CPU utilization is measured against CE. – An uncapped partition receiving more than its CE will record 100% but will be using more. © 2003 Concepts of Solution Design IBM Corporation  SMT – Thread priorities compound the variable speed rate. – Twice as many logical CPUs.  For accounting, interval may be incorrectly allocated. – New hardware support is required.  Processor utilization register (PURR) records actual clock ticks spent executing a partition. – Used by performance commands (for example, new flags) and accounting modules. – Third party tools will need to be modified.
  • 50. Eserver pSeries © 2003 IBM Corporation Virtual I/O Server
  • 51. ^Eserver pSeries Virtual I/O Server  Provides an operating environment for virtual I/O administration – Virtual I/O server administration – Restricted scriptable command line user interface (CLI)  Minimum hardware requirements – POWER5 VIO capable machine – Hardware management console – Storage adapter – Physical disk – Ethernet adapter – At least 128 MB of memory  Capabilities of the Virtual I/O Server – Ethernet Adapter Sharing – Virtual SCSI disk • Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific models of EMC, HDS, and STK disk subsystems, attached using Fiber Channel – Interacts with AIX and Linux partitions © 2003 Concepts of Solution Design IBM Corporation
  • 52. ^Eserver pSeries Virtual I/O Server (Cont.)  Installation CD when Advanced POWER Virtualization feature is ordered  Configuration approaches for high availability – Virtual I/O Server • LVM mirroring • Multipath I/O • EtherChannel – Second virtual I/O server instance in another partition © 2003 Concepts of Solution Design IBM Corporation
  • 53. ^Eserver pSeries Virtual SCSI  Allows sharing of storage devices  Vital for shared processor partitions – Overcomes potential limit of adapter slots due to Micro- Partitioning – Allows the creation of logical partitions without the need for additional physical resources  Allows attachment of previously unsupported storage solutions © 2003 Concepts of Solution Design IBM Corporation
  • 54. ^Eserver pSeries VSCSI server and client architecture overview  Virtual SCSI is based on a client/server relationship.  The virtual I/O resources are assigned using an HMC.  Virtual SCSI enables sharing of adapters as well as disk devices.  Dynamic LPAR operations allowed.  Dynamic mapping between physical Client partition Linux Virtual I/O Server partition Client partition AIX LVM Logical volume 1 VSCSI server adapter Logical hdisk volume 2 VSCSI server adapter VSCI client adapter and virtual resources on the virtual I/O server. POWER Hypervisor Physical adapter hdisk VSCI client adapter Physical disk (SCSI, FC) © 2003 Concepts of Solution Design IBM Corporation
  • 55. ^Eserver pSeries Client partition hdisk LVM VSCI client adapter POWER Hypervisor VSCSI server adapter LVM LV hdisk Virtual I/O Server partition Virtual devices  Are defined as LVs in the I/O server partition – Normal LV rules apply  Appear as real devices (hdisks) in the hosted partition  Can be manipulated using Logical Volume Manager just like an ordinary physical disk  Can be used as a boot device and as a NIM target  Can be shared by multiple clients Virtual disk © 2003 Concepts of Solution Design IBM Corporation
  • 56. ^Eserver pSeries SCSI RDMA and Logical Remote Direct Memory Access  SCSI transport protocols define the rules for exchanging information between SCSI initiators and targets.  Virtual SCSI uses the SCSI RDMA Protocol (SRP). – SCSI initiators and targets have the ability to directly transfer information between their respective address spaces.  SCSI requests and responses are sent using the Virtual SCSI adapters.  The actual data transfer, however, is done using the Logical Redirected DMA protocol. Virtual I/O Server partition Client partition AIX VSCI device driver (target) Device Mapping VSCI device driver (initiator) Data Buffer Reliable Command / Response Transport Logical Remote Direct Memory Access POWER Hypervisor Physical adapter device driver Physical adapter © 2003 Concepts of Solution Design IBM Corporation
  • 57. ^Eserver pSeries Virtual SCSI security  Only the owning partition has access to its data.  Data-information is copied directly from the PCI adapter to the client’s memory. © 2003 Concepts of Solution Design IBM Corporation
  • 58. ^Eserver pSeries Performance considerations  Twice as many processor cycles to do VSCSI as a locally attached disk I/O (evenly distributed on the client partition and virtual I/O server) – The path of each virtual I/O request involves several sources of overhead that are not present in a non-virtual I/O request. – For a virtual disk backed by the LVM, there is also the performance impact of going through the LVM and disk device drivers twice.  If multiple partitions are competing for resources from a VSCSI server, care must be taken to ensure enough server resources (CPU, memory, and disk) are allocated to do the job.  If not constrained by CPU performance, dedicated partition throughput is comparable to doing local I/O.  Because there is no caching in memory on the server I/O partition, it's memory requirements should be modest. © 2003 Concepts of Solution Design IBM Corporation
  • 59. ^Eserver pSeries Limitations  Hosting partition must be available before hosted partition boot.  Virtual SCSI supports FC, parallel SCSI, and SCSI RAID.  Maximum of 65535 virtual slots in the I/O server partition.  Maximum of 256 virtual slots on a single partition.  Support for all mandatory SCSI commands.  Not all optional SCSI commands are supported. © 2003 Concepts of Solution Design IBM Corporation
  • 60. ^Eserver pSeries Implementation guideline  Partitions with high performance and disk I/O requirements are not recommended for implementing VSCSI.  Partitions with very low performance and disk I/O requirements can be configured at minimum expense to use only a portion of a logical volume.  Boot disks for the operating system.  Web servers that will typically cache a lot of data. © 2003 Concepts of Solution Design IBM Corporation
  • 61. ^Eserver pSeries Virtual I/O Server partition Client partition Virtual I/O Server partition LVM VSCSI server adapter LVM VSCSI server adapter LVM VSCSI client adapter VSCSI client adapter POWER Hypervisor LVM mirroring  This configuration protects virtual disks in a client partition against failure of: – One physical disk – One physical adapter – One virtual I/O server  Many possibilities exist to exploit this great function! Physical SCSI adapter Physical disk (SCSI) Physical SCSI adapter Physical disk (SCSI) © 2003 Concepts of Solution Design IBM Corporation
  • 62. ^Eserver pSeries Virtual I/O Server partition Client partition Virtual I/O Server partition LVM (hdisk) VSCSI server adapter LVM (hdisk) VSCSI server adapter LVM VSCSI client adapter VSCSI client adapter POWER Hypervisor Multipath I/O  This configuration protects virtual disks in a client partition against failure of: – Failure of one physical FC adapter in one I/O server – Failure of one Virtual I/O server  Physical disk is assigned as a whole to the client partition  Many possibilities exist to exploit this great function! Physical FC adapter Physical FC adapter SAN Switch Physical disk ESS © 2003 Concepts of Solution Design IBM Corporation
  • 63. ^Eserver pSeries Virtual LAN overview  Virtual network segments on top of physical switch devices.  All nodes in the VLAN can communicate without any L3 routing or inter-VLAN bridging.  VLANs provides: – Increased LAN security – Flexible network deployment over traditional network devices  VLAN support in AIX is based on the IEEE 802.1Q VLAN implementation. – VLAN ID tagging to Ethernet frames – VLAN ID restricted switch ports Node A-1 Node A-2 Switch A Switch B Switch C VLAN 1 VLAN 2 X Node B-1 Node B-2 Node B-3 Node C-1 Node C-2 © 2003 Concepts of Solution Design IBM Corporation
  • 64. ^Eserver pSeries Virtual Ethernet  Enables inter-partition communication. – In-memory point to point connections  Physical network adapters are not needed.  Similar to high-bandwidth Ethernet connections.  Supports multiple protocols (IPv4, IPv6, and ICMP).  No Advanced POWER Virtualization feature required. – POWER5 Systems – AIX 5L V5.3 or appropriate Linux level – Hardware management console (HMC) © 2003 Concepts of Solution Design IBM Corporation
  • 65. ^Eserver pSeries Virtual Ethernet connections  VLAN technology implementation – Partitions can only access data directed to them.  Virtual Ethernet switch provided by the POWER Hypervisor  Virtual LAN adapters appears to the OS as physical adapters – MAC-Address is generated by the HMC.  1-3 Gb/s transmission speed – Support for large MTUs (~64K) on AIX.  Up to 256 virtual Ethernet adapters – Up to 18 VLANs.  Bootable device support for NIM OS installations Linux partition AIX partition Virtual Ethernet adapter Virtual Ethernet adapter AIX partition Virtual Ethernet adapter Virtual Ethernet switch POWER Hypervisor © 2003 Concepts of Solution Design IBM Corporation
  • 66. ^Eserver pSeries Virtual Ethernet switch  Based on IEEE 802.1Q VLAN standard – OSI-Layer 2 – Optional Virtual LAN ID (VID) – 4094 virtual LANs supported – Up to 18 VIDs per virtual LAN port  Switch configuration through HMC © 2003 Concepts of Solution Design IBM Corporation
  • 67. ^Eserver pSeries How it works Virtual Ethernet adapter Virtual VLAN switch port PHYP caches source MAC Y IEEE VLAN Check VLAN header header? N Insert VLAN header Port allowed? N Dest. MAC in table? Y Trunk adapter defined? Configured associated switch port Match for VLAN Nr. in table? N Y N Y Deliver Pass to Trunk N Y N adapter Drop packet © 2003 Concepts of Solution Design IBM Corporation
  • 68. ^Eserver pSeries Performance considerations  Virtual Ethernet performance – Throughput scales nearly linear with the allocated capacity entitlement  Virtual LAN vs. Gigabit Ethernet throughput – Virtual Ethernet adapter has higher raw throughput at all MTU sizes – In-memory copy is more efficient at larger MTU Throughput/0.1 entitlement Throughput per 0.1 entitlement 1000 800 600 400 200 0 [Mb/s] 0.1 0.3 0.5 0.8 1 65394 9000 1500 CPU entitlements MTU size Throughput, TCP_STREAM 10000 8000 6000 4000 2000 0 Throughput [Mb/s] 1 VLAN Gb Ethernet MTU 1500 1500 9000 9000 65394 65394 Simpl./Dupl. S D S D S D © 2003 Concepts of Solution Design IBM Corporation
  • 69. ^Eserver pSeries Limitations  Virtual Ethernet can be used in both shared and dedicated processor partitions provided with the appropriate OS levels.  A mixture of Virtual Ethernet connections, real network adapters, or both are permitted within a partition.  Virtual Ethernet can only connect partitions within a single system.  A system’s processor load is increased when using virtual Ethernet. © 2003 Concepts of Solution Design IBM Corporation
  • 70. ^Eserver pSeries Implementation guideline  Know your environment and the network traffic.  Choose a high MTU size, as it makes sense for the network traffic in the Virtual LAN.  Use the MTU size 65394 if you expect a large amount of data to be copied inside your Virtual LAN.  Enable tcp_pmtu_discover and udp_pmtu_discover in conjunction with MTU size 65394.  Do not turn off SMT.  No dedicated CPUs are required for virtual Ethernet performance. © 2003 Concepts of Solution Design IBM Corporation
  • 71. ^Eserver pSeries Connecting Virtual Ethernet to external networks  Routing – The partition that routes the traffic to the external work does not necessarily have to be the virtual I/O server. AIX partition 3.1.1.10 3.1.1.10 AIX partition 1.1.1.100 3.1.1.1 Virtual Ethernet switch POWER Hypervisor Linux partition Physical adapter AIX partition 4.1.1.10 4.1.1.11 AIX partition 2.1.1.100 4.1.1.1 Virtual Ethernet switch POWER Hypervisor Linux partition Physical adapter IP subnet 1.1.1.X AIX Server IP subnet 2.1.1.X Linux Server IP Router 1.1.1.1 2.1.1.1 1.1.1.10 2.1.1.10 © 2003 Concepts of Solution Design IBM Corporation
  • 72. ^Eserver pSeries Shared Ethernet Adapter  Connects internal and external VLANs using one physical adapter.  SEA is a new service that acts as a layer 2 network switch. – Securely bridges network traffic from a virtual Ethernet adapter to a real network adapter  SEA service runs in the Virtual I/O Server partition. – Advanced POWER Virtualization feature required – At least one physical Ethernet adapter required  No physical I/O slot and network adapter required in the client partition. © 2003 Concepts of Solution Design IBM Corporation
  • 73. ^Eserver pSeries Shared Ethernet Adapter (Cont.)  Virtual Ethernet MAC are visible to outside systems.  Broadcast/multicast is supported.  ARP (Address Resolution Protocol) and NDP (Neighbor Discovery Protocol) can work across a shared Ethernet.  One SEA can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server.  Virtual Ethernet adapter configured in the Shared Ethernet Adapter must have the trunk flag set. – The trunk Virtual Ethernet adapter enables a layer-2 bridge to a physical adapter  IP fragmentation is performed or an ICMP packet too big message is sent when the shared Ethernet adapter receives IP (or IPv6) packets that are larger than the MTU of the adapter that the packet is forwarded through. © 2003 Concepts of Solution Design IBM Corporation
  • 74. ^Eserver pSeries Virtual Ethernet and Shared Ethernet Adapter security  VLAN (virtual local area network) tagging description taken from the IEEE 802.1Q standard.  The implementation of this VLAN standard ensures that the partitions have no access to foreign data.  Only the network adapters (virtual or physical) that are connected to a port (virtual or physical) that belongs to the same VLAN can receive frames with that specific VLAN ID. © 2003 Concepts of Solution Design IBM Corporation
  • 75. ^Eserver pSeries Performance considerations  Virtual I/O-Server performance – Adapters stream data at media speed if the Virtual I/O server has enough capacity entitlement. – CPU utilization per Gigabit of throughput is higher with a Shared Ethernet adapter. Throughput [Mb/s] 2000 1500 1000 500 0 Virtual I/O Server Throughput, TCP_STREAM 1 2 3 4 MTU 1500 1500 9000 9000 Simplex/Duplex simplex duplex simplex duplex CPU Utilisation [%cpu/Gb] 100 80 60 40 20 0 Virtual I/O Server normalized CPU utilisation, TCP_STREAM 1 2 3 4 MTU 1500 1500 9000 9000 Simplex/Duplex simplex duplex simplex duplex © 2003 Concepts of Solution Design IBM Corporation
  • 76. ^Eserver pSeries Limitations  System processors are used for all communication functions, leading to a significant amount of system processor load.  One of the virtual adapters in the SEA on the Virtual I/O server must be defined as a default adapter with a default PVID.  Up to 16 Virtual Ethernet adapters with 18 VLANs on each can be shared on a single physical network adapter.  Shared Ethernet Adapter requires: – POWER Hypervisor component of POWER5 systems – AIX 5L Version 5.3 or appropriate Linux level © 2003 Concepts of Solution Design IBM Corporation
  • 77. ^Eserver pSeries Implementation guideline  Know your environment and the network traffic.  Use a dedicated network adapter if you expect heavy network traffic between Virtual Ethernet and local networks.  If possible, use dedicated CPUs for the Virtual I/O Server.  Choose 9000 for MTU size, if this makes sense for your network traffic.  Don’t use Shared Ethernet Adapter functionality for latency critical applications.  With MTU size 1500, you need about 1 CPU per gigabit Ethernet adapter streaming at media speed.  With MTU size 9000, 2 Gigabit Ethernet adapters can stream at media speed per CPU. © 2003 Concepts of Solution Design IBM Corporation
  • 78. ^Eserver pSeries Shared Ethernet Adapter configuration  The Virtual I/O Server is configured with at least one physical Ethernet adapter.  One Shared Ethernet Adapter can be shared by multiple VLANs.  Multiple subnets can connect using a single adapter on the Virtual I/O Server. AIX partition VLAN 1 10.1.1.11 Virtual I/O Server Shared Ethernet Adapter ent0 VLAN 2 VLAN 1 Virtual Ethernet switch POWER Hypervisor Linux partition VLAN 2 10.1.2.11 Physical adapter VLAN 1 AIX Server 10.1.1.14 VLAN 2 Linux Server 10.1.2.15 © 2003 Concepts of Solution Design IBM Corporation
  • 79. ^Eserver pSeries Multiple Shared Ethernet Adapter configuration  Maximizing throughput – Using several Shared Ethernet Adapters – More queues – More performance Linux partition AIX partition VLAN 2 10.1.2.11 VLAN 1 10.1.1.11 Virtual Ethernet switch POWER Hypervisor Virtual I/O Server Shared Ethernet Adapter VLAN VLAN 2 ent0 ent1 1 Physical adapter VLAN 1 AIX Server 10.1.1.14 VLAN 2 Linux Server 10.1.2.15 Physical adapter © 2003 Concepts of Solution Design IBM Corporation
  • 80. ^Eserver pSeries Multipath routing with dead gateway detection  This configuration protects your access to the external network against: – Failure of one physical network adapter in one I/O server – Failure of one Virtual I/O server – Failure of one gateway AIX partition Virtual I/O Server 2 Shared Ethernet Adapter VLAN 2 9.3.5.21 Virtual I/O Server 2 Shared Ethernet Adapter VLAN 1 9.3.5.11 Multipath routing with dead gateway detection default route to 9.3.5.10 via 9.3.5.12 default route to 9.3.5.20 via 9.3.5.22 VLAN 2 9.3.5.22 VLAN 1 9.3.5.12 Virtual Ethernet switch POWER Hypervisor ent0 Physical adapter External network ent0 Physical adapter Gateway 9.3.5.10 Gateway 9.3.5.20 © 2003 Concepts of Solution Design IBM Corporation
  • 81. ^Eserver pSeries Shared Ethernet Adapter commands  Virtual I/O Server commands – lsdev -type adapter: Lists all the virtual and physical adapters. – Choose the virtual Ethernet adapter we want to map to the physical Ethernet adapter. – Make sure the physical and virtual interfaces are unconfigured (down or detached). – mkvdev: Maps the physical adapter to the virtual adapter, creates a layer 2 bridge, and defines the default virtual adapter with its default VLAN ID. It creates a new Ethernet interface (for example, ent5). – The mktcpip command is used for TCP/IP configuration on the new Ethernet interface (for example, ent5).  Client partition commands – No new commands are needed; the typical TCP/IP configuration is done on the virtual Ethernet interface that it is defined in the client partition profile on the HMC. © 2003 Concepts of Solution Design IBM Corporation
  • 82. ^Eserver pSeries Virtual SCSI commands  Virtual I/O Server commands – To map a LV: • mkvg: Creates the volume group, where a new LV will be created using the mklv command. • lsdev: Shows the virtual SCSI server adapters that could be used for mapping with the LV. • mkvdev: Maps the virtual SCSI server adapter to the LV. • lsmap -all: Shows the mapping information. – To map a physical disk: • lsdev: Shows the virtual SCSI server adapters that could be used for mapping with a physical disk. • mkvdev: Maps the virtual SCSI server adapter to a physical disk. • lsmap -all: Shows the mapping information.  Client partition commands – No new commands needed; the typical device configuration uses the cfgmgr command. © 2003 Concepts of Solution Design IBM Corporation
  • 83. ^Eserver pSeries Section Review Questions 1. Any technology improvement will boost performance of any client solution. a. True b. False 2. The application of technology in a creative way to solve client’s business problems is one definition of innovation. a. True b. False © 2003 Concepts of Solution Design IBM Corporation
  • 84. ^Eserver pSeries Section Review Questions 3. Client’s satisfaction with your solution can be enhanced by which of the following? a. Setting expectations appropriately. b. Applying technology appropriately. c. Communicating the benefits of the technology to the client. d. All of the above. © 2003 Concepts of Solution Design IBM Corporation
  • 85. ^Eserver pSeries Section Review Questions 4. Which of the following are available with POWER5 architecture? a. Simultaneous Multi-Threading. b. Micro-Partitioning. c. Dynamic power management. d. All of the above. © 2003 Concepts of Solution Design IBM Corporation
  • 86. ^Eserver pSeries Section Review Questions 5. Simultaneous Multi-Threading is the same as hyperthreading, IBM just gave it a different name. a. True. b. False. © 2003 Concepts of Solution Design IBM Corporation
  • 87. ^Eserver pSeries Section Review Questions 6. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server has to be configured with at least one physical Ethernet adapter. a. True. b. False. © 2003 Concepts of Solution Design IBM Corporation
  • 88. ^Eserver pSeries Review Question Answers © 2003 Concepts of Solution Design IBM Corporation 1. b 2. a 3. d 4. d 5. b 6. a
  • 89. ^Eserver pSeries Unit Summary  You should now be able to: – Describe the relationship between technology and solutions. – List key IBM technologies that are part of the POWER5 products. – Be able to describe the functional benefits that these technologies provide. – Be able to discuss the appropriate use of these technologies. © 2003 Concepts of Solution Design IBM Corporation
  • 90. ^Eserver pSeries Reference  You may find more information here: IBM eServer pSeries AIX 5L Support for Micro-Partitioning and Simultaneous Multi-threading White Paper Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers SG24-7940 IBM eServer p5 Virtualization – Performance Considerations SG24-5768 © 2003 Concepts of Solution Design IBM Corporation

Editor's Notes

  1. The pursuit of scientific discovery provides the basis for new technologies which can be incorporated into new and better products which can then enable clients to solve business problems. In this section we will look at some of the technologies that have been introduced recently with the intention of considering how these technologies may be taken into account in our solution design process. This section is divided into two parts and will take approximately two hours to complete. IBM's rich history of discovery and innovation has brought international recognition. In addition to five Nobel prizes, IBM researchers have been recognized with five U.S. National Medals of Technology, five National Medals of Science and 19 memberships in the National Academy of Sciences. IBM Research has more than 46 members of the National Academy of Engineering and well over 300 industry organization fellows.Over the years, we have received international recognition for our discoveries and produced 22,357 patents - nearly 7,000 more than the nearest competitor. But what's more important than the statistics is the effect these discoveries and patents are having in the marketplace -- and that's what really makes something innovative.Our ability to apply advanced technologies rapidly for our clients distinguishes IBM from all other companies. During the past ten years, our notable breakthroughs in technologies such as copper chips, Web caching, data mining and silicon germanium have helped our clients gain competitive advantage.Our continued innovation springs from our creative, dedicated people whose work continues to shape the future for our customers, the I/T industry and the world.
  2. What do you think of when you think of innovation? Can you provide examples from your own experience? How would you define innovation?
  3. That advances in technology can be applied to problems confronting our clients is not an issue, this is what we do! But consider the case where the technology that we provide fails to solve the problem to the client’s satisfaction. When this happens what might be one of the possible causes of dissatisfaction? It has been demonstrated that the degree of benefit you get from applying a particular technology is directly related to its appropriateness in the situation. Amdahl’s Law shows this relationship. Secondly the client may have unreasonable expectations. Setting expectations is certainly part of a successful solution design process but for the purpose of this section we will focus more on the technologies that are available and consider what problems might be solved by them. We will also look at some of the possible misapplications of the technology and their consequences. Can you think of examples in your own experience where expectations were not met by the technology that was provided? Was the reason a failure of the technology, the expectations of the client or both?
  4. This chart shows the POWER4 and POWER5 chips. POWER4 415mm2 115W @1.1 GHz, 156W @ 1.3 GHz 174M transistors POWER4+ 267mm2 75W @ 1.2 GHz, 95W @ 1.45 GHz, 125W @ 1.7 GHz 184M transistors POWER5 389mm2 167W @ 1.65 GHz 276M transistors
  5. Featuring single- and multithreaded execution, the POWER5 provides higher performance in the single-threaded mode than its POWER4 predecessor at equivalent frequencies. Enhancements include dynamic resource balancing to efficiently allocate system resources to each thread, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance. The POWER5 processor supports the 64-bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. Each processor core has a separate 64 KB L1 instruction cache and a 32 KB L1 data cache. The L1 cache is shared by the two hardware threads of the processor core. Both the processor cores in a chip share a 1.88 MB unified L2. The processor chip houses a L3 cache controller, which provides for a L3 cache directory on the chip. However, the L3 cache itself is on a separate Merged Logic DRAM (MLD) cache chip. The L3 is a 36 MB victim cache of the L2 cache. The L3 cache is shared by both the processor cores of the POWER5 chip. Needless to say, the L2 and L3 caches are shared by all the hardware threads of both processor cores on the chip. Unlike POWER4, which was specifically aimed at high-end server applications, design features of POWER5 are targeted at a broad range of applications from low-end 1-2-way servers to high-end 64-way super-servers. SMPLink is a very low latency switchless interconnect technology that allows nodes to be interconnected as flat SMPs. The actual SMPLink ports come directly off of the POWER5 chip. When connected, the SMPLinks provide a direct path between each POWER5 chip. With the introduction of SMT, more instructions execute per cycle per processor core, thus increasing the core’s and the chip’s total switching power. POWER5 was design to maintain both binary and structural compatibility with existing POWER4 systems to ensure that binaries continue executing properly and all application optimizations carry forward to newer systems. The rest of the improvements and new features, such as enhancements to the memory subsystem and SMT, are discussed on later charts.
  6. The L1 instruction cache is 2-way set associative with LRU (Least Recently Used) replacement policy. The L1 Instruction cache is also kept coherent with the L2 cache. The L1 data cache is 4-way set associative with LRU replacement policy. The L1 data cache is a store-through design. It never holds modified data. The POWER5 L2 cache is accessed by both cores of the chip. It maintains full hardware coherence within the system and can supply intervention data to cores on other POWER5 chips. L2 is an in-line cache, unlike L1s, which are store-through. It is fully inclusive of the two L1 data caches and L1 instruction caches (one L1 data and instruction cache per core). The 1.88 MB (1,920 KB) L2 is physically implemented in three slices, each 640 KB in size. Each of these three slices have separate L2 cache controllers. Either processor core of the chip can independently access each L2 controller. The L2 slices are 10-way set-associative. 10-way set associativity (vs. 8-way on POWER4) helps to reduce cache contention by allowing more potential storage locations for a given cache line. L3 is a unified 36 MB cache accessed by both cores on the POWER5 processor chip. It maintains full hardware coherence with the system and can supply intervention data to cores on other POWER5 processor chips. Logically, L3 is an inline cache. Actually, L3 is a victim cache of the L2 - that is, all valid cache lines evicted out of the L2 due to associativity (victimized) will be cast out to L3. The L3 is not inclusive of L2; the same line will never reside in both L2 and L3 at the same time. The L3 cache is implemented off-chip as a separate MLD cache chip, but its directory is on the processor chip itself. This helps the processor check the directory after an L2 miss without experiencing off-chip delays. The L3 cache in POWER5 is on the processor side and not on the memory side of the fabric as in POWER4. This is well depicted in the previous chart. This design lets the POWER5 satisfy L2 cache misses more frequently, with hits on the off chip 36 MB MLD L3, thus avoiding traffic on the interchip fabric. References to data not on the on chip L2 cause the system to check the L3 cache before sending requests onto the interchip fabric. The memory controller is also on the POWER5 chip and helps to reduce memory latencies by eliminating driver and receiver delays to an external controller.
  7. The figure shows the high-level structures of POWER4- and POWER5-based systems. The POWER4 handles up to a 32-way symmetric multiprocessor. Going beyond 32 processors increases interprocessor communication, resulting in high traffic on the interconnection fabric. This can cause greater contention and negatively affect system scalability. Moving the level-three (L3) cache from the memory side to the processor side of the fabric, allows POWER5 to satisfy level-two (L2) cache misses more frequently, with hits in the 36 MB off-chip L3 cache, and avoiding traffic on the interchip fabric. References to data not resident in the on-chip L2 cache cause the system to check the L3 cache before sending requests on to the interconnection fabric. Moving the L3 cache provides significantly more cache on the processor side than previously available, thus reducing traffic on the fabric and allowing POWER5-based systems to scale to higher levels of symmetric multiprocessing. Initial POWER5 systems support 64 physical processors. The POWER4 includes a 1.41 MB on-chip L2 cache. POWER4+ chips are similar in design to the POWER4, but are fabricated in 130 nm technology rather than the POWER4’s 180 nm technology. The POWER4+ includes a 1.5 MB on-chip L2 cache, whereas the POWER5 supports a 1.875 MB on-chip L2 cache. POWER4 and POWER4+ systems both have 32 MB L3 caches, whereas POWER5 systems have a 36 MB L3 cache. The L3 cache operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In POWER4 and POWER4+ systems, the L3 was an inline cache for data retrieved from memory. Because of the higher transistor density of the POWER5’s 130 nm technology, memory controller was moved on chip and eliminated a chip previously needed for the memory controller function. These two changes in the POWER5 also have the significant side benefits of reducing latency to the L3 cache and main memory, as well as reducing the number of chips necessary to build a system.
  8. Simultaneous Multi-Threading is a new technology which is part of the POWER5 architecture. You need to know how it works and what benefits it can provide to your clients. It is not a cure-all! Being able to articulate the advantages clearly is one part of understanding it, being able to set client’s expectations appropriately is another. In this topic we will discuss the evolution of SMT, its function and some guidelines for appropriate use in solution design.
  9. The POWER4 microprocessor is a high-frequency, speculative superscalar machine with out-of-order instruction execution capabilities. Eight independent execution units are capable of executing instructions in parallel, providing a significant performance attribute known as superscalar execution. These include two identical floating-point execution units, each capable of completing a multiply/add instruction each cycle (for a total of four floating-point operations per cycle), two load-store execution units, two fixed-point execution units, a branch execution unit, and a conditional register unit used to perform logical operations on the condition register. To keep these execution units supplied with work, each processor can fetch up to eight instructions per cycle and can dispatch and complete instructions at a rate of up to five per cycle. A processor is capable of tracking over 200 instructions in-flight at any point in time. Instructions may issue and execute out-of-order with respect to the initial instruction stream, but are carefully tracked so as to complete in program order. In addition, instructions may execute speculatively to improve performance when accurate predictions can be made about conditional scenarios. The figure in this chart depicts the POWER4 processor execution pipeline. The deeply pipelined structure of the machine’s design is shown. Each small box represents a stage of the pipeline (a stage is the logic that is performed in a single processor cycle). Note that there is a common pipeline which first handles instruction fetching and group formation, and this then divides into four different pipelines corresponding to four of the five types of execution units in the machine (the CR execution unit is not shown, which is similar to the fixed-point execution unit). All pipelines have a common termination stage, which is the group completion (CP) stage. Instruction fetch, group formation, and dispatch: The instructions that make up a program are read in from storage and are executed by the processor. During each cycle, up to eight instructions may be fetched from cache according to the address in the instruction fetch address register (IFAR) and the fetched instructions are scanned for branches (corresponding to the IF, IC, and BP stages in the figure). Since instructions may be executed out of order, it is necessary to keep track of the program order of all instructions in-flight. In the POWER4 microprocessor, instructions are tracked in groups of one to five instructions rather than as individual instructions. Groups are formed in the pipeline stages D0, D1, D2, and D3. This requires breaking some of the more complex PowerPC instructions down into two or more simpler instructions.
  10. Modern processors have multiple specialized execution units, each of which is capable of handling a small subset of the instruction set architecture – some will handle integer operations, some floating point, and so on. These execution units are capable of operating in parallel and so several instructions of a program may be executing simultaneously. However, conventional processors execute instructions from a single instruction stream. Despite microarchitectural advances, execution unit utilization remains low in today’s microprocessors. It is not unusual to see average execution unit utilization rates of approximately 25% across a broad spectrum of environments. To increase execution unit utilization, designers use thread-level parallelism, in which the physical processor core executes instructions from more than one instruction stream. To the operating system, the physical processor core appears as if it is a symmetric multiprocessor containing two logical processors. There are at least three different methods for handling multiple threads: Coarse-grained multi-threading Fine-grained multi-threading Simultaneous multi-threading (SMT) Lets take a look at these methods.
  11. In coarse-grained multi-threading, only one thread executes at any instance. When a thread encounters a long-latency event, such as a cache miss, the hardware swaps in a second thread to use the machine’s resources, rather than letting the machine remain idle. By allowing other work to use what otherwise would be idle cycles, this scheme increases overall system throughput. To conserve resources, both threads share many system resources, such as architectural registers. Hence, swapping program control from one thread to another requires several cycles. IBM implemented coarse-grained multi-threading in the IBM pSeries Model 680.
  12. Coarse-grained multi-threading was introduced in IBM’s Star series of processors (for example, the RS64-IV, available in the S85) to improve system performance for many workloads. A multi-threaded processor improves the resource utilization of a processor core by running several hardware threads in parallel. For the Star series, the number of concurrent threads was two. The basic idea is that when one or more threads of a processor are stalled on a long latency event (for example, waiting on a cache miss), other threads try to keep the core busy. However, AIX needed to be aware of the difference between logical and physical processors and had the responsibility for making sure that each logical processor had a dispatchable thread - even to the point of creating idle threads. Note that coarse-grained multi-threading was never widely used by customers. Partly this was due to the fact that it was not enabled by default and required a reboot to activate it. Another reason was that performance was variable and could, in fact, have a negative impact. For workloads with high thread:processor ratios (for example, TPC-C), HMT can deliver ~20% increased performance. In other workloads, for example, Business Intelligence, where the thread:processor ratio is <2:1, then AIX must create dummy threads for the processor context switch to take place. Switching to/from these dummy threads cost about six machine cycles, whereas without Coarse-grained multi-threading being active, AIX would not have performed a context switch at all. The other disadvantage of Coarse-grained multi-threading was that it disabled Dynamic CPU Deallocation.
  13. A variant of coarse-grained multi-threading is fine-grained multi-threading. Machines of this class execute threads in successive cycles, in round-robin fashion. Accommodating this design requires duplicate hardware facilities. When a thread encounters a long-latency event, its cycles remain unused. POWER4 processors implemented an SMP on a chip, but are not considered fine-grained multi-threading.
  14. The POWER5 processor core supports both enhanced SMT and single-threaded (ST) operation modes. This chart shows the POWER5’s instruction pipeline, which is identical to the POWER4’s. All pipeline latencies in the POWER5, including the branch misprediction penalty and load-to-use latency with an L1 data cache hit, are the same as in the POWER4. The identical pipeline structure lets optimizations designed for POWER4-based systems perform equally well on POWER5-based systems. In SMT mode, the POWER5 uses two separate instruction fetch address registers to store the program counters for the two threads. Instruction fetches (IF stage) alternate between the two threads. In ST mode, the POWER5 uses only one program counter and can fetch instructions for that thread every cycle. It can fetch up to eight instructions from the instruction cache (IC stage) every cycle. The two threads share the instruction cache and the instruction translation facility. In a given cycle, all fetched instructions come from the same thread. Some differences are: There are 120 physical general purpose registers (GPRs) and 120 physical floating-point registers (FPRs). In a single-treaded operation, the POWER5 makes all physical registers available to the single thread, allowing higher instruction-level parallelism. Two groups can commit per cycle, one from each thread. The L1 instruction and data caches are the same size as in the POWER4— 64 KB and 32 KB — but their associativity has doubled to two- and four-way. The first-level data translation table is now fully associative, but the size remains at 128 entries.
  15. In simultaneous multi-threading (SMT), as in other multithreaded implementations, the processor fetches instructions from more than one thread. What differentiates this implementation is its ability to schedule instructions for execution from all threads concurrently. With SMT, the system dynamically adjusts to the environment, allowing instructions to execute from each thread if possible, and allowing instructions from one thread to utilize all the execution units if the other thread encounters a long latency event. The POWER5 design implements two-way SMT on each of the chip’s two processor cores. Although a higher level of multi-threading is possible, our simulations showed that the added complexity was unjustified. As designers add simultaneous threads to a single physical processor, the marginal performance benefit decreases. In fact, additional multi-threading might decrease performance because of cache thrashing, as data from one thread displaces data needed by another thread.
  16. Which Workloads are Likely to Benefit From Simultaneous Multi-threading? This is a very difficult question to answer, because the performance benefit of simultaneous multi-threading is workload dependent. Most measurements of commercial workloads have received a 25-40% boost and a few have been even greater. These measurements were taken in a dedicated partition. Simultaneous multi-threading is also expected to help shared processor partitions. The extra threads give the partition a boost after it is dispatched, because they enable the partition to recover its working set quicker. Subsequently, they perform like they would in a dedicated partition. It may be somewhat non-intuitive, but simultaneous multi-threading is at its best, when the performance of the cache is at its worst. The question may also be answered with the following generalities. Any workload where the majority of individual software threads highly utilize any resource in the processor or memory will benefit little from simultaneous multi-threading. For example, workloads that are heavily floating-point intensive are likely to gain little from simultaneous multi-threading and are the ones most likely to lose performance. They tend to heavily utilize either the floating-point units or the memory bandwidth, while workloads that have a very high Cycles Per Instruction (CPI) count tend to utilize processor and memory resources poorly and usually see the greatest simultaneous multi-threading benefit. These large CPIs are usually caused by high cache miss rates from a very large working set. Large commercial workloads typically have this characteristic, although it is somewhat dependent upon whether the two hardware threads share instructions or data or are completely distinct. Workloads that share instructions or data, which would include those that run a lot in the operating system or within a single application, tend to have better SMT benefits. Workloads with low CPI and low cache miss rates tend to see a benefit, but a smaller one.
  17. The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system. Dynamic resource-balancing logic monitors resources such as the GCT and the load miss queue to determine if one thread is hogging resources. For example, if one thread encounters multiple L2 cache load misses, dependent instructions can back up in the issue queues, preventing additional groups from dispatching and slowing down the other thread. To prevent this, resource-balancing logic detects that a thread has reached a threshold of L2 cache misses and throttles that thread. The other thread can then flow through the machine without encountering congestion from the stalled thread. The POWER5 resource balancing logic also monitors how many GCT entries each thread is using. If one thread starts to use too many GCT entries, the resource balancing logic throttles it back to prevent its blocking the other thread. Depending on the situation, the POWER5 resource-balancing logic has three thread-throttling mechanisms: Reducing the thread’s priority Inhibiting the thread’s instruction decoding until the congestion clears Flushing all the thread’s instructions that are waiting for dispatch and holding the thread’s decoding until the congestion clears
  18. Adjustable thread priority lets software determine when one thread should have a greater (or lesser) share of execution resources. (All software layers — operating systems, middleware, and applications — can set the thread priority. Some priority levels are reserved for setting by a privileged instruction only.) Reasons for choosing an imbalanced thread priority include the following: A thread is in a spin loop waiting for a lock. A thread has no immediate work to do and is waiting in an idle loop. One application must run faster than another. The POWER5 microprocessor supports eight software-controlled priority levels for each thread. Level 0 is in effect when a thread is not running. Levels 1 (the lowest) through 7 apply to running threads. The POWER5 chip observes the difference in priority levels between the two threads and gives the one with higher priority additional decode cycles. The figure shows how the difference in thread priority affects the relative performance of each thread. If both threads are at the lowest running priority (level 1), the microprocessor assumes that neither thread is doing meaningful work and throttles the decode rate to conserve power.
  19. Not all applications benefit from SMT. Having two threads executing on the same processor will not increase the performance of applications with execution-unit-limited performance or applications that consume all the chip’s memory bandwidth. For this reason, the POWER5 supports the ST execution mode. In this mode, the POWER5 gives all the physical resources, including the GPR and FPR rename pools, to the active thread, allowing it to achieve higher performance than a POWER4 system at equivalent frequencies. The POWER5 supports two types of Single-threaded operation: An inactive thread can be in either a dormant or a null state. From a hardware perspective, the only difference between these states is whether or not the thread awakens on an external or decrementer interrupt. In the dormant state, the operating system boots up in SMT mode, but instructs the hardware to put the thread into the dormant state when there is no work for that thread. To make a dormant thread active, either the active thread executes a special instruction or an external or decrementer interrupt targets the dormant thread. The hardware detects these scenarios and changes the dormant thread to the active state. It is software’s responsibility to restore the architected state of a thread transitioning from the dormant to the active state. When a thread is in the null state, the operating system is unaware of the thread’s existence. As in the dormant state, the operating system does not allocate resources to a null thread. This mode is advantageous if all the system’s executing tasks perform better in ST mode.
  20. Micro-partitioning is a mainframe-inspired technology that is based on two major advances in the area of server virtualization. Physical processors and I/O devices have been virtualized, enabling these resources to be shared by multiple partitions. There are several advantages associated with this technology, including finer grained resource allocations, more partitions, and higher resource utilization. The virtualization of processors requires a new partitioning model, since it is fundamentally different from the partitioning model used on POWER4 processor-based servers, where whole processors are assigned to partitions. These processors are owned by the partition and are not easily shared with other partitions. They may be assigned through manual dynamic logical partitioning (LPAR) procedures. In the new micro-partitioning model, physical processors are abstracted into virtual processors, which are assigned to partitions. These virtual processor objects cannot be shared, but the underlying physical processors are shared, since they are used to actualize virtual processors at the platform level. This sharing is the primary feature of this new partitioning model, and it happens automatically. Note that the virtual processor abstraction is implemented in the hardware and the POWER Hypervisor, a component of firmware. From an operating system perspective, a virtual processor is indistinguishable from a physical processor, unless the operating system had been enhanced to be made aware of the difference. The key benefit of implementing partitioning in the hardware/firmware is to allow any operating system to run on POWER5 technology with little or no changes. Optionally, for optimal performance, the operating system can be enhanced to exploit micro-partitioning more in-depth, for example, by voluntarily relinquishing CPU cycles to the POWER Hypervisor, when they are not needed. AIX 5L V5.3 is the first version of AIX 5L that includes such enhancements. The system administrator defines the number of virtual processors that may be utilized by a partition as well as the actual physical processor capacity that should be applied to actualize those virtual processors. The system administrator may specify that a fraction of a physical processor be applied to a partition enabling fractional processor capacity partitions to be created.
  21. The diagram in this chart shows the relationship and new concepts regarding Micro-Partitioning processor terminology used in this presentation. Virtual processors These are the whole number of concurrent operations that the operating system can use on a partition. The processing power can be conceptualized as being spread equally across these virtual processors. Selecting the optimal number of virtual processors depends on the workload in the partition. Some partitions benefit from greater concurrence, where other partitions require greater power. The maximum number of virtual processors per partition is 64. Dedicated processors Dedicated processors are whole processors that are assigned to a single partition. If you choose to assign dedicated processors to a logical partition, you must assign at least one processor to that partition. By default, a powered-off logical partition using dedicated processors will have its processors available to the shared processing pool. When the processors are in the shared processing pool, an uncapped partition that needs more processing power can use the idle processing resources. However, when you power on the dedicated partition while the uncapped partition is using the processors, the activated partition will regain all of its processing resources. If you want to prevent dedicated processors from being used in the shared processing pool, you can disable this function using the logical partition profile properties panels on the Hardware Management Console. Shared processor pool The POWER Hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions. Deconfigured processor This is a failing processor left outside the system’s configuration after a dynamic processor deallocation has occurred.
  22. Micro-partitioning allows for multiple partitions to share one physical processor. A partition may be defined with a processor capacity as small as 10 processor units. This represents 1/10 of a physical processor. Each processor can be shared by up to 10 shared processor partitions. The shared processor partitions are dispatched and time-sliced on the physical processors under control of the POWER Hypervisor. Micro-partitioning is supported across the entire POWER5 product line from the entry to the high-end systems. Shared processor partitions still need dedicated memory, but the partitions I/O requirements can be supported through Virtual Ethernet and Virtual SCSI Server. Utilizing all virtualization features support for up to 254 shared processor partitions is possible. The shared processor partitions are created and managed by the HMC. When you start creating a partition, you have to choose between a shared processor partition and a dedicated processor partition. When setting up a partition, you have to define the resources that belong to the partition like memory and IO resources. For shared processor partitions, you have to specify the following partition attributes that are used to define the dimensions and performance characteristics of shared partitions: Minimum, desired, and maximum processor capacity Minimum, desired, and maximum number of virtual processors Capped or uncapped Variable capacity weight
  23. Processor capacity attributes are specified in terms of processing units. 1.0 processing unit represents one physical processor. 1.5 processing units is equivalent to one and a half physical processors. For example, a shared processor partition with 2.2 processing units has the equivalent power of 2.2 physical processors. Processor units are also used; they represent the processor percentage allocated to a partition. One processor unit represents one percent of one physical processor. One hundred processor units is equivalent to one physical processor. Shared processor partitions may be defined with a processor capacity as small as 1/10 of a physical processor. A maximum of 10 partitions may be started for each physical processor in the platform. A maximum of 254 partitions may be active at the same time. When a partition is started, the system chooses the partition’s entitled processor capacity from the specified capacity range. The value that is chosen represents a commitment of capacity that is reserved for the partition. This capacity cannot be used to start another shared partition; otherwise, capacity could be overcommitted. Preference is given to the desired value, but these values cannot always be used, because there may not be enough unassigned capacity in the system. In that event, a different value is chosen, which must be greater than or equal to the minimum capacity attribute. Otherwise, the partition cannot be started. The same basic process applies for selecting the number of online virtual processors with the extra restriction that each virtual processor must be granted at least 1/10 of a processing unit of entitlement. In this way, the entitled processor capacity may affect the number of virtual processors that are automatically brought online by the system during boot. The maximum number of virtual processors per partition is 64. The POWER Hypervisor saves and restores all necessary processor states, when preempting or dispatching virtual processors, which for simultaneous multi-threading-enabled processors means two active thread contexts. The result for shared processors is that two of the logical CPUs will always be scheduled in a physical sense together. These sibling threads are always scheduled in the same partition.
  24. A capped partition is not allowed to exceed it capacity entitlement, while an uncapped partition is. In fact, it may exceed its maximum processor capacity. An uncapped partition is only limited in its ability to consume cycles by the lack of online virtual processors and its variable capacity weight attribute. The variable capacity weight attribute is a number between 0–255, which represents the relative share of extra capacity that the partition is eligible to receive. This parameter applies only to uncapped partitions. A partition’s share is computed by dividing its variable capacity weight by the sum of the variable capacity weights for all uncapped partitions. Therefore, a value of 0 may be used to prevent a partition from receiving extra capacity. This is sometimes referred to as a “soft cap”. There is overhead associated with the maintenance of online virtual processors, so clients should carefully consider their capacity requirements before choosing values for these attributes. In general, the value of the minimum, desired, and maximum virtual processor attributes should parallel those of the minimum, desired, and maximum capacity attributes in some fashion. A special allowance should be made for uncapped partitions, since they are allowed to consume more than their entitlement. If the partition is uncapped, then the administrator may want to define the desired and maximum virtual processor attributes x% above the corresponding entitlement attributes. The exact percentage is installation specific, but 25-50% seems like a reasonable number.
  25. The following sequence of charts shows the relationship between the different parameters used for controlling processor capacity attributes for a partition. In the example, the size of the shared pool is fixed – as is the capacity entitlement for the partition in which the workload is running. No other partitions are active – this allows the example workload to use all available resource and means that we are ignoring the effects of Capacity Weights.
  26. This is the baseline for our example. The partition is configured to have 16 virtual processors and is uncapped. Assuming, as we are, that there are no other partitions active, then this workload can use all 16 real processors in the pool. Note that the partition could have more than 16 virtual processors allocated. If that were the case, then all virtual processors would be scheduled and would be time-sliced across the available real processors. We’ll discuss scheduling in detail later. The dark area shows the number of available virtual processors. The lighter area shows the total amount of CPU resource being consumed. The workload completes in 26 minutes.
  27. This is exactly the same workload as before and uses exactly the same total amount of CPU resource. However, the number of virtual processors has been reduced to 12. Consequently, the workload is limited to using the equivalent of 12 real processor’s worth of power, that is, a virtual processor cannot use more than one real processor’s worth of power. Because of the reduced amount of CPU power available within any given time interval, the workload now requires 27 minutes to complete.
  28. Exactly the same workload as before. Now, however, the partition is capped. For the first time, the capacity entitlement becomes effective and the total amount of resource available within any given time interval (actually, every 10 ms) is limited to 9.5 processing units, that is, the equivalent of having 9.5 real processor’s worth of power. Note that all 12 of the virtual processors are being dispatched, but the scheduling algorithm in the POWER Hypervisor limits the amount of time each can be executing. The workload now requires 28 minutes to complete.
  29. One of the advantages of the shared processor architecture is that processor capacity can be changed without impacting applications or middleware. This is accomplished by modifying the entitled capacity or the variable capacity weight of the partition; however, the ability of the partition to utilize this extra capacity is restricted by the number of online virtual processors, so the user may have to increase this number in some cases to take advantage of the extra capacity. The main restriction here is that the CE per VP must remain greater than 0.1. The variable capacity weight parameter applies to uncapped partitions. It controls the ability of the partition to receive cycles beyond its entitlement, which is dependent on there being unutilized capacity at the platform level. The client may want to modify this parameter, if a partition is getting too much processing capacity or not enough. Real processors can, of course, only be added or removed from the shared pool itself. If you recall the discussion on defining a partition, you will realize that removal of a processor from the shared pool may mean that the POWER Hypervisor can longer guarantee CE for all active partitions. Before the DLPAR operation can be honored then, it may be necessary to reduce the CE for some, or all, of the active partitions. Dynamic memory addition and removal is also supported. The only change in this area is that the size of the logical memory block (LMB) has changed. It has been reduced from 256 MB to 16 MB to allow for thinner partitions. There is no impact associated with these changes. The new LMB size applies to dedicated partition also. The size of the LMB can be set at the service console. Notification of changes to these parameters will be provided so that applications, such as license managers, performance analysis tools, and high level schedulers, can monitor and control the allocation and use of system resources in shared processor partitions. This may be accomplished through scripts, APIs, or kernel services. Other DLPAR operations perform as expected.
  30. Allocate processors, memory and I/O to create virtual servers Minimum 128 MB memory, one CPU, one PCI-X adapter slot All resources can be allocated independently Resources can be moved between live partitions Applications notified of configuration changes Movement can be automated using Partition Load Manager Works with AIX 5.2+ or Linux 2.4+
  31. The section provides a description of the new POWER Hypervisor.
  32. A major feature of the new POWER5 machines is a new, active Hypervisor that represents a convergence with iSeries systems. iSeries and pSeries machines will now have a common Hypervisor and common functionality, which will mean reduced development effort and faster time to market for new functions. However, each brand will retain a unique value proposition. New functions provided for pSeries are Shared Processor Partitions and Virtual I/O. Both of these have been available for iSeries on POWER4 systems and pSeries gets the benefit of using tried and tested microcode to implement these functions on POWER5. iSeries benefits from the POWER Hypervisor convergence as well and gains the ability to run AIX in an LPAR (rather than the more limited PACE environment available today). There are some restrictions for the AIX environment on iSeries (for example, device support) and the primary reason for offering this function is to broaden the range of software applications available to iSeries customers.
  33. This is a simplified diagram showing the sourcing of different elements in the converged POWER Hypervisor. The blue boxes show functions that have been sourced either directly from the existing pSeries POWER4 Hypervisor or from the pSeries architecture. Purple boxes (lighter shading) show those sourced directly from the iSeries SLIC (System Licensed Internal Code) – which is part of OS/400. Some boxes are gradated, and these represent functions that combine elements of the pSeries and iSeries implementation models.
  34. The POWER Hypervisor provides the same basic functions as the POWER4 Hypervisor, plus some new functions designed for shared processor LPARs and virtual I/O. Combined with features designed into the POWER5 processor, the POWER Hypervisor delivers functions that enable other system technologies, including micro-partitioning, virtualized processors, IEEE VLAN compatible virtual switch, virtual SCSI adapters, and virtual consoles. The POWER Hypervisor is a component of the system’s firmware that will always be installed and activated, regardless of system configuration. It operates as a hidden partition, with no entitled capacity assigned to it. Newly architected Hypervisor calls (hcalls) provide a means for the operating system to communicate with the POWER Hypervisor, allowing more efficient usage of physical processor capacity by supporting the scheduling heuristic of minimizing idle time. The POWER Hypervisor is a key component to the functions shown in the chart. It performs the following tasks: Provides an abstraction layer between the physical hardware resources and the logical partitions using them Enforces partition integrity by providing a security layer between logical partitions Controls the dispatch of virtual processors to physical processors Saves and restores all processor state information during logical processor context switch Controls hardware I/O interrupts management facilities for logical partitions
  35. The POWER4 processor introduced support for logical partitioning with a new privileged processor state called Hypervisor mode. It is accessed via a Hypervisor call function, which is generated by the operating system kernel running in a partition. Hypervisor mode allows for a secure mode of operation that is required for various system functions where logical partition integrity and security are required. The Hypervisor validates that the partition has ownership of the resources it is attempting to access, such as processor, memory, and I/O, then completes the function. This mechanism allows for complete isolation of partition resources. In the POWER5 processor, further design enhancements are introduced that enable the sharing of processors by multiple partitions. The Hypervisor decrementer (HDECR) is a new hardware facility in the POWER5 design that provides the POWER Hypervisor with a timed interrupt independent of partition activity. HDECR interrupts are routed directly to the POWER Hypervisor, and use only POWER Hypervisor resources to capture state information from the partition. The HDECR is used for fine grained dispatching of multiple partitions on shared processors. It also provides a means for the POWER Hypervisor to dispatch physical processor resources for its own execution. With the addition of shared partitions and SMT, a mechanism was required to track physical processor resource utilization at a processor thread level. System architecture for POWER5 introduces a new register called the processor utilization resource register (PURR) to accomplish this. It provides the partition with an accurate cycle count to measure activity during timeslices dispatched on a physical processor. The PURR is a POWER Hypervisor resource, assigned one per processor thread, that is incremented at a fixed rate whenever the thread running on a virtual processor is dispatched on a physical processor.
  36. Multiple logical partitions configured to run with a pool of shared physical processors require a robust mechanism to guarantee the distribution of available processing cycles. The POWER Hypervisor manages this task in the POWER5 processor based servers. Each Micro-partition is configured with a specific processor entitlement, based on a quantity of processing units, which is referred to as the partition’s entitled capacity or capacity entitlement (CE). The entitled capacity, along with a defined number of virtual processors, defines the physical processor resource that will be allotted to the partition. The POWER Hypervisor uses the POWER5 HDECR, which is programmed to generate an interrupt every 10 ms, as a timing mechanism for controlling the dispatch of physical processors to system partitions. Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. The minimum amount of resource that the POWER Hypervisor will allocate to a virtual processor, within a dispatch cycle, is 1 ms of execution time per VP. This gives rise to the current restriction of 10 Micro-Partitions per physical processor. The POWER Hypervisor calculates the amount of time each VP will execute by reference to the CE (as shown on the slide). Note that the calculation for uncapped partitions is more complicated and involves their capacity weight and depends on their being unused capacity available. The amount of time that a virtual processor runs before it is timesliced is based on the partition entitlement, which is specified indirectly by the system administrator. The partition entitlement is evenly distributed amongst the online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch cycle. The POWER Hypervisor uses the architectural metaphor of a “dispatch wheel” with a fixed rotation period of X milliseconds to guarantee that each virtual processor receives its share of the entitlement in a timely fashion. Virtual processors are time sliced through the use of the hardware decrementer much like the operating system time slices threads. In general, the POWER Hypervisor uses a very simple scheduling model. The basic idea is that processor entitlement is distributed with each turn of the POWER Hypervisor’s dispatch wheel, so each partition is guaranteed a relatively constant stream of service.
  37. Virtual processors have dispatch latency, since they are scheduled. When a virtual processor is made runnable, it is placed on a run queue by the POWER Hypervisor, where it sits until it is dispatched. The time between these two events is referred to as dispatch latency. The dispatch latency of a virtual processor is a function of the partition entitlement and the number of virtual processors that are online in the partition. Entitlement is equally divided among these online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch. The smaller the dispatch cycle, the greater the dispatch latency. Timers have latency issues also. The hardware decrementer is virtualized by the POWER Hypervisor at the virtual processor level, so that timers will interrupt the initiating virtual processor at the designated time. If a virtual processor is not running, then the timer interrupt has to be queued with the virtual processor, since it is delivered in the context of the running virtual processor. External interrupts have latency issues also. External interrupts are routed directly to a partition. When the operating system makes the accept-pending-interrupt Hypervisor call, the POWER Hypervisor, if necessary, dispatches a virtual processor of the target partition to process the interrupt. The POWER Hypervisor provides a mechanism for queuing up external interrupts that is also associated with virtual processors. Whenever this queuing mechanism is used, latencies are introduced. These latency issues are not expected to cause functional problems, but they may present performance problems for real-time applications. To quantify matters, the worst case virtual processor dispatch latency is 18 milliseconds, since the minimum dispatch cycle that is supported at the virtual processor level is one millisecond. This figure is based on the minimum partition entitlement of 1/10 of a physical processor and the 10 millisecond rotation period of the Hypervisor's dispatch wheel. It can be easily visualized by imagining that a virtual processor is scheduled in the first and last portions of two 10 millisecond intervals. In general, if these latencies are too great, then clients may increase entitlement, minimize the number of online virtual processors without reducing entitlement, or use dedicated processor partitions.
  38. The POWER Hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions. In shared partitions, there is not a fixed relationship between virtual processors and the physical processors that actualize them. The POWER Hypervisor may use any physical processor in the shared processor pool when it schedules the virtual processor. By default, it attempts to use the same physical processor, but this cannot always be guaranteed. The POWER Hypervisor employs the notion of a home node for virtual processors, enabling it to select the best available physical processor from a memory affinity perspective for the virtual processor that is to be scheduled.
  39. Affinity scheduling is designed to preserve the content of memory caches, so that the working data set of a job can be read or written in the shortest time period possible. Affinity is actively managed by the POWER Hypervisor, since each partition has a completely different context. Currently, there is one shared processor pool, so all virtual processors are implicitly associated with the same pool. The POWER Hypervisor attempts to dispatch work in a way that maximizes processor, cache, and memory affinity. When the POWER Hypervisor is dispatching a VP (for example, at the start of a dispatch interval) it will attempt to use the same physical CPU as this VP was previously dispatched on, or a processor on the same chip, or on the same MCM (or in the same node). If a CPU becomes idle, the POWER Hypervisor will look for work for that processor. Priority will be given to runnable VPs that have an affinity for that processor. If none can be found, then the POWER Hypervisor will select a VP that has affinity to no real processor (for example, because previous affinity has expired) and, finally, will select a VP that is uncapped. The objective of this strategy is to try to improve system scalability by minimizing inter-cache communication.
  40. In general, operating systems and applications running in shared partitions need not be aware that they are sharing processors. However, overall system performance can be significantly improved by minor operating system changes. The main problem here is that the POWER Hypervisor cannot distinguish between the OS doing useful work and, for example, spinning on a lock. The result is that the OS may waste much of its CE doing nothing of value. AIX 5L provides support for optimizing overall system performance of shared processor partitions. An OS therefore needs to be modified so that it can signal to the POWER Hypervisor when it is no longer able schedule work, and it can give up the remainder of its time. This results in better utilization of the real processors in the shared processors in the pool. The dispatch mechanism may utilizes hcalls to communicate between the operating system and the POWER Hypervisor. When a virtual processor is active on a physical processor and the operating system detects an inability to utilize processor cycles, it may cede or confer its cycles back to the POWER Hypervisor, enabling it to schedule another virtual processor on the physical processor for the remainder of the dispatch cycle. Reasons for a cede or confer may include the virtual processor running out of work and becoming idle, entering a spin loop to wait for a resource to free, or waiting for a long latency access to complete. There is no concept of credit for cycles that are ceded or confered. Entitled cycles not used during a dispatch interval are lost. A virtual processor that has ceded cycles back to the POWER Hypervisor can be reactivated using a prod Hypervisor call. If the operating system running on another virtual processor within the logical partition detects that work is available for one of its idle processors, it can use the prod Hypervisor call to signal the POWER Hypervisor to make the prodded virtual processor runnable again. Once dispatched, this virtual processor would resume execution at the return from the cede Hypervisor call. The “payback” for the OS is that the POWER Hypervisor will redispatch it if it becomes runnable again during the same dispatch interval – allocating it the remainder of its CE if possible. While not required, the use of these primitives is highly desirable for performance reasons, because they improve locking and minimize idle time. Response time and throughput should be improved, if these primitives are used. Their use is not required, because the POWER Hypervisor time slices virtual processors, which enables it to sequence through each virtual processor in a continuous fashion. Forward progress is thus assured without the use of the primitives.
  41. In this example, there are three logical partitions defined, sharing the processor cycles from two physical processors, spanning two 10 ms Hypervisor dispatch intervals. Logical partition 1 is defined with an entitlement capacity of 0.8 processing units, with two virtual processors. This allows the partition 80% of one physical processor for each 10 ms dispatch window for the shared processor pool. For each dispatch window, the workload is shown to use 40% of each physical processor during each dispatch interval. It is possible for a virtual processor to be dispatched more than one time during a dispatch interval. Note that in the first dispatch interval, the workload executing on virtual processor 1 is not a continuous utilization of physical processor resource. This can happen if the operating system confers cycles, and is reactivated by a prod Hypervisor call. Logical partition 2 is configured with one virtual processor and a capacity of 0.2 processing units, entitling it to 20% usage of a physical processor during each dispatch interval. In this example, a worst case dispatch latency is shown for this virtual processor, where the 2 ms are used in the beginning of dispatch interval 1 and the last 2 ms of dispatch interval 2, leaving 16 ms between processor allocation. Logical partition 3 contains three virtual processors, with an entitled capacity of 0.6 processing units. Each of the partition’s three virtual processors consumes 20% of a physical processor in each dispatch interval, but in the case of virtual processor 0 and 2, the physical processor they run on changes between dispatch intervals. The POWER Hypervisor does attempt to maintain physical processor affinity when dispatching virtual processors. It will always first try to dispatch the virtual processor on the same physical processor as it last ran on, and depending on resource utilization, will broaden its search out to the other processor on the POWER5 chip, then to another chip on the same MCM, then to a chip on another MCM.
  42. This chart introduces POWER Hypervisor involvement in the virtual I/O functions described later. With the introduction of micro-partitioning, the ability to dedicate physical hardware adapter slots to each partition becomes impractical. Virtualization of I/O devices allows many partitions to communicate with each other, and access networks and storage devices external to the server, without dedicating I/O to an individual partition. Many of the I/O virtualization capabilities introduced with the POWER5 processor based IBM eServer products are accomplished by functions designed into the POWER Hypervisor. The POWER Hypervisor does not own any physical I/O devices, and it does not provide virtual interfaces to them. All physical I/O devices in the system are owned by logical partitions. Virtual I/O devices are owned by an I/O hosting partition, which provides access to the real hardware that the virtual device is based on. The POWER Hypervisor implements the following operations required by system partitions to support virtual I/O: Provide control and configuration structures for virtual adapter images required by the logical partitions Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition Along with the operations listed above, the POWER Hypervisor allows for the virtualization of I/O interrupts. To maintain partition isolation, the POWER Hypervisor controls the hardware interrupt management facilities. Each logical partition is provided controlled access to the interrupt management facilities using hcalls. Virtual I/O adapters and real I/O adapters use the same set of Hypervisor calls interfaces.Virtual I/O adapters are defined by system administrators during logical partition definition. Configuration information for the virtual adapters is presented to the partition operating system by the system firmware. Virtual TTY console support Each partition needs to have access to a system console. Tasks such as operating system install, network setup, and some problem analysis activities require a dedicated system console. The POWER Hypervisor provides virtual console using a virtual TTY or serial adapter and a set of Hypervisor calls to operate on them.Depending on the system configuration, the operating system console can be provided by the Hardware Management Console (HMC) virtual TTY or from a terminal emulator connected to physical serial ports on the system’s service processor.
  43. Processor utilization is a critical component of metering, performance monitoring, and capacity planning. With respect to POWER5 technologies, two new advances that will be commonly used will combine to make the concept of utilization much more complex: partitioning, specifically, shared processor partitioning, and simultaneous multi-threading. Individually, they add complexity to this concept, but together they multiply the complexity. Some changes will be required to performance monitoring and accounting tools for support of Micro-Partitioning. One issue that will need to be addressed is that CPU utilization (using traditional monitoring methods) will be recorded against CE. Clearly, an uncapped partition may exceed its CE and may therefore use more than 100% of its entitlement. Similarly, accounting tools (which rely on the 10 ms timer interrupt) may incorrectly record resource utilization for partitions that cede part of their dispatch interval (or which have picked up part of another via a confer Hypervisor call) The POWER5 processor architecture attempts to deal with these complex issues by introducing a new processor register that is intended for measuring utilization. This new register, Processor Utilization Resource Register (PURR), is used to approximate the time that a virtual processor is actually running on a physical processor. The register advances automatically so that the operating system can always get the current up to date value. The Hypervisor saves and restores the register across virtual processor context switches to simulate a monotonically increasing atomic clock at the virtual processor level.
  44. The Virtual I/O server is an appliance that provides virtual storage and shared Ethernet capability to client logical partitions on a POWER5 system. It allows a physical adapter on one partition to be shared by one or more partitions, enabling clients to consolidate and potentially minimize the number of physical adapters.
  45. The Virtual I/O Server provides a restricted scriptable command line user interface (CLI). All aspects of Virtual I/O server administration are accomplished through the CLI, including: Device management (physical, virtual, LVM) Network configuration Software installation and update Security User management Installation of OEM software Maintenance tasks The creation and deletion of the virtual client and server adapter is managed by the HMC GUI and POWER5 server firmware. The association between the client and server adapters is defined when the virtual adapters are created. The optional Advanced POWER Virtualization hardware feature, which enables micro-partitioning on a POWER5 servers, is required to activate the Virtual I/O Server. A small logical partition with the enough resources to share to other partitions is required. The following is a list of minimum hardware requirements to create the Virtual I/O Server partition: POWER5 server, the VIO capable machine. Hardware management console to create the partition and assign resources. Storage adapter: The server partition needs at least one storage adapter. Physical disk: A disk large enough to make sufficient-sized logical volumes on it. Ethernet adapter: Allows securely route network traffic from a virtual Ethernet to a real network adapter. Memory: At least 128 MB of memory. The Virtual I/O Server provides the Virtual SCSI (VSCSI) Target and Shared Ethernet adapter virtual I/O function to client partitions. This is accomplished by assigning physical devices to the Virtual I/O Server partition, then configuring virtual adapters on the clients to allow communication between the client and the Virtual I/O Server.
  46. Installation of the Virtual I/O Server partition is performed from a special mksysb CD that will be provided to customers that order the Advanced POWER Virtualization feature. This is a dedicated software for the virtual I/O server operations, so the virtual I/O server software is only supported in virtual I/O server partitions. The Virtual I/O Server partition itself is configured using a command line interface. Defining partition resources such as virtual Ethernet or virtual disk connections to client systems requires use of the HMC. Virtual I/O server supports the following operating systems as virtual I/O client: AIX 5L Version 5.3 SUSE LINUX Enterprise Server 9 for POWER Red Hat Enterprise Linux AS for POWER Version 3 When we talk about providing high availability for the virtual I/O server, we are talking about incorporating the I/O resources (physical and virtual) on the virtual I/O server as well as the client partitions into a configuration that is designed to eliminate single points of failure. The virtual I/O server per se is not highly available. If there is a problem in the virtual I/O server or if it should crash, the client partitions will see I/O errors and not be able to access the adapters and devices which are backed by the virtual I/O server. However, redundancy can be built into the configuration of the physical and virtual I/O resources at several stages. Since the virtual I/O server is an AIX based appliance, redundancy for physical devices attached to the virtual I/O server can be provided by using capabilities like LVM mirroring, Multipath I/O, and EtherChannel. When running two instances of the virtual I/O server, you can use LVM mirroring, Multipath I/O, EtherChannel, or Multipath routing with dead gateway detection in the client partition to provide highly available access to virtual resources hosted in the separate virtual I/O server partitions.
  47. The virtualization features of the POWER5 platform support up to 254 partitions, while the biggest planned server only provides up to 160 I/O slots. With each partition requiring at least one I/O slot for disk attachment and another one for network attachment, this puts a constraint on the number of partitions. To overcome these physical limitations, I/O resources have to be virtualized. Virtual SCSI provides the means to do this for storage devices. On the other hand, virtual I/O has a value proposition to it. It allows the creation of logical partitions without the need for additional physical resources. This facilities on demand computing and server consolidation. Virtual I/O also provides a more economic I/O model by using physical resources more efficiently through sharing. Furthermore, virtual I/O allows attachment of previously unsupported storage solutions. As long as the virtual I/O server supports the attachment of a storage resource, any client partition can access this storage by using virtual SCSI adapters. For example, at the time of writing, there is no native support for EMC storage devices on Linux. By running Linux in logical partition of a POWER5 server, this becomes possible. A Linux client partition can access the EMC storage through a virtual SCSI adapter. Requests from the virtual adapters are mapped to the physical resources in the virtual I/O server partition. Driver support for the physical resources is therefore only needed in the virtual I/O server partition.
  48. Virtual SCSI is based on a client/server relationship. The virtual I/O server owns the physical resources and acts as the server. The logical partitions access the virtual I/O resources provided by the virtual I/O server as the clients. The virtual I/O resources are assigned using an HMC. Often the virtual I/O server partition is also referred to as hosting partition and the client partitions as hosted partitions. Virtual SCSI enables sharing of adapters as well as disk devices. To make a physical or a logical volume available to a client partition, it is assigned to a virtual SCSI server adapter in the virtual I/O server partition. The client partition accesses its assigned disks through a virtual SCSI client adapter. It sees standard SCSI devices and LUNs through this virtual adapter. Virtual SCSI resources can be assigned and removed dynamically. On the HMC, virtual SCSI target and server adapters can be assigned and removed from a partition using dynamic logical partitioning. The mapping between physical and virtual resources on the virtual I/O server can also be done dynamically. This chart shows an example where one physical disk is split up into two logical volumes inside the virtual I/O server. Each of the two client partitions is assigned one logical volume, which it accesses through a virtual I/O adapter (vSCSI Client Adapter). Inside the partition, the disk is seen as normal hdisk.
  49. A disk owned by the virtual I/O server can either be exported and assigned to a client partition as a whole or it can be split into several logical volumes. Each of these logical volumes can then be assigned to a different partition. A virtual disk device is mapped by the server VSCSI adapter to a logical volume and presented to the hosted partition as a physical direct access device. There can be many virtual disk devices mapped onto a single physical disk. The system administrator will create a virtual disk device by choosing a logical volume and binding it to a VSCSI server adapter. The virtual I/O adapters are connected to a virtual host bridge, which AIX treats much like a PCI host bridge. It is represented in the ODM as a bus device whose parent is sysplanar0. The virtual I/O adapters are represented as adapter devices with the virtual host bridge as their parent. On the virtual I/O server, each logical volume or physical volume that is exported to a client partition is represented by a virtual target device, which is a child of a virtual SCSI server adapter. On the client partition, the exported disks are visible as normal hdisks; however, they are defined in subclass vscsi. They have a virtual SCSI client adapter as parent. Note that virtual disks can be used as boot devices and as NIM targets. Virtual disks can be shared by multiple clients, allowing for configurations using concurrent LVM, for example.
  50. The SCSI family of standards provides many different transport protocols that define the rules for exchanging information between SCSI initiators and targets. Virtual SCSI uses the SCSI RDMA Protocol (SRP), which defines the rules for exchanging SCSI information in an environment where the SCSI initiators and targets have the ability to directly transfer information between their respective address spaces. SCSI requests and responses are sent using the Virtual SCSI adapters that communicate through the POWER Hypervisor. The actual data transfer however is done directly between a data buffer in the client partition and the physical adapter in the Virtual I/O Server by using the Logical Remote Direct Memory Access (LRDMA) protocol. This chart shows how the data transfer using LRDMA works.
  51. Using Virtual SCSI means the Virtual I/O Server acts like a storage box to provide the data. Instead of SCSI or Fiber cable, the connection is done by the POWER Hypervisor. The Virtual SCSI device drivers of the I/O Server and the POWER Hypervisor ensures that only the owning partition has access to its data. Neither other partitions nor the I/O server itself are able to make the client data visible. Only the control-information is going through the I/O Server; the data-information, however, is copied directly from the PCI-adapter to the clients memory.
  52. Enabling VSCSI may not result in a performance benefit. This is because there is an overhead associated with Hypervisor calls, and because of the several steps involved for the I/O requests from the initiator to target partition, VSCSI will use additional CPU cycles when processing I/O requests. This will not give the same performance from VSCSI devices as from dedicated devices. The use of Virtual SCSI will roughly double the amount of CPU time to perform I/O as compared to using directly attached storage. This CPU load is split between the Virtual I/O Server and the Virtual SCSI Client. Performance is expected to degrade when multiple partitions are sharing a physical disk, and actual impact on overall system performance will vary by environment. The base-case configuration is when one physical disk is dedicated to a partition. The following are general performance considerations when using Virtual SCSI: Since VSCSI is a client/server model, the CPU utilization will always be higher than doing local I/O. A reasonable expectation is a total of twice as many cycles to do VSCSI as a locally attached disk I/O (more or less evenly distributed on the client and server). If multiple partitions are competing for resources from a VSCSI server, care must be taken to ensure enough server resources (CPU, memory, and disk) are allocated to do the job. If not constrained by CPU performance, dedicated partition throughput is comparable to doing local I/O. There is no data caching in memory on the server partition. Thus, all I/Os that it services are essentially synchronous disk I/Os. Because there is no caching in memory on the server partition, its memory requirements should be modest. The path of each virtual I/O request involves several sources of overhead that are not present in a non-virtual I/O request. For a virtual disk backed by the LVM, there is also the performance impact of going through the LVM and disk device drivers twice. (IBM eServer p5 Virtualization - Performance Considerations, SG24-5768)
  53. Supported devices At the time of writing, virtual SCSI supports FC, parallel SCSI, and SCSI raid devices. Any other devices, such as SSA, tape, or CD-ROM, are not supported. Number of adapters Virtual SCSI itself does not have any limitations in terms of the number of supported devices or adapters. However, the virtual I/O server partition supports a maximum of 65535 virtual I/O slots. A maximum of 256 virtual I/O slots can be assigned to a single partition. Obviously, every I/O slot needs some resources to be instantiated. Therefore, the size of the virtual I/O server puts a limit to the number of virtual adapters that can be configured. SCSI commands The SCSI protocol defines mandatory and optional commands. While virtual SCSI supports all the mandatory commands, not all optional commands are supported.
  54. Partitions with high performance and disk I/O requirements are not recommended for implementing VSCSI. Partitions with very low performance and disk I/O requirements can be configured at minimum expense to use only a logical volume. Using a logical volume for virtual storage means that the number of partitions is no longer limited by hardware, but the trade-off is that some of the partitions will have less than optimal storage performance. The suitable applications for VSCSI might be the boot disks for the operating system or Web servers that will typically cache a lot of data.
  55. This chart shows a virtual I/O server configuration using LVM mirroring on the client partition. The client partition is LVM mirroring its logical volumes using the two virtual SCSI client adapters. Each of these adapters is assigned to a separate virtual I/O server partition. The two physical disks are each attached to a separate virtual I/O server partition and made available to the client partition through a virtual SCSI server adapter.
  56. This chart shows a configuration using Multipath I/O to access an ESS disk. The client partition sees two paths to the physical disk through MPIO. Each path is using a different virtual SCSI adapter to access disk. Each of these virtual SCSI adapters is backed by a separate virtual I/O server. This type of configuration will only work when the physical disk is assigned as a whole to the client partition. You cannot split up the physical disk into logical volumes at the virtual I/O server level. Depending on your SAN topology, each physical adapter could possibly be connected to a separate SAN switch to provide redundancy, and on the physical disk level, the ESS will provide redundancy as it RAIDs the disks internally.
  57. Virtual LAN (VLAN) is a technology used for establishing virtual network segments on top of physical switch devices. If configured appropriately, a VLAN definition can straddle multiple switches. Typically, a VLAN is a broadcast domain that enables all nodes in the VLAN to communicate with each other without any L3 routing or inter-VLAN bridging. In the diagram shown in this chart, two VLANs (VLAN 1 and 2) are defined on three switches (Switch A, B, and C). Although nodes C-1 and C-2 are physically connected to the same switch C, traffic between two nodes can be blocked. To enable communication between VLAN 1 and 2, L3 routing or inter-VLAN bridging should be established between them; this is typically provided by an L3 device. The use of VLAN provides increased LAN security and flexible network deployment over traditional network devices. VLAN support in AIX is based on the IEEE 802.1Q VLAN implementation. The IEEE 802.1Q VLAN is achieved by adding a VLAN ID tag to an Ethernet frame, and the Ethernet switches restricting the frames to ports that are authorized to receive frames with that VLAN ID. Switches also restrict broadcasts to the logical network by ensuring that a broadcast packet is delivered to all ports that are configured to receive frames with the VLAN ID that the broadcast frame was tagged with. A port on a VLAN capable switch has a default PVID that indicates the default VLAN the port belongs to. The switch adds the PVID tag to untagged frames that are received by that port. In addition to a PVID, a port may belong to additional VLANs and have those VLAN IDs assigned to it that indicates the additional VLANs the port belongs to. A port will only accept untagged packets or packets with a VLAN ID (PVID or additional VIDs) tag of the VLANs the port belongs to. A port configured in the untagged mode is only allowed to have a PVID and will receive untagged packets or packets tagged with the PVID. The untagged port feature helps systems that do not understand VLAN tagging communicate with other systems using standard Ethernet. Each VLAN ID is associated with a separate Ethernet interface to the upper layers (IP and so on) and creates unique logical Ethernet adapter instances per VLAN (for example, ent1, ent2, and so on). You can configure multiple VLAN logical devices on a single system. Each VLAN logical devices constitutes an additional Ethernet adapter instance. These logical devices can be used to configure the same Ethernet IP interfaces as are used with physical Ethernet adapters.
  58. The Virtual Ethernet enables inter-partition communication without the need for physical network adapters in each partition. The Virtual Ethernet allows the administrator to define in-memory point to point connections between partitions. These connections exhibit similar characteristics, as high bandwidth Ethernet connections supports multiple protocols (IPv4, IPv6, and ICMP). Virtual Ethernet requires a POWER5 system with either AIX 5L V5.3 or the appropriate level of Linux and a Hardware Management Console (HMC) to define the Virtual Ethernet devices. Virtual Ethernet does not require the purchase of any additional features or software, such as the Advanced Virtualization Feature. Virtual Ethernet is also called Virtual LAN or even VLAN, which can be confusing, because these terms are also used in network topology topics. But the Virtual Ethernet, which uses virtual devices, has nothing to do with the VLAN known from Network-Topology, which divides a LAN in further Sub-LANs.
  59. The Virtual Ethernet connections supported in POWER5 systems use VLAN technology to insure that the partitions can only access data directed to them. The POWER Hypervisor provides a Virtual Ethernet switch function based on the IEEE 802.1Q VLAN standard, which allows partition communication within the same server. Partitions wishing to communicate through a Virtual Ethernet channel will need to create an additional in-memory channel. This will require a user to be able to request the creation of an in-memory channel between partitions on the HMC. The kernel would create a virtual adapter for each memory channel indicated by the firmware. A normal AIX configuration routine would create the device special files. A virtual LAN adapter appears to the operating system in the same way as a physical adapter. A unique Media Access Control (MAC) address is also generated when the user creates a Virtual Ethernet adapter. A prefix value can be assigned for the system so that the generated MAC addresses in a system consists of a common system prefix, plus an algorithmically-generated unique part per adapter. The MAC-Address of the virtual adapter is generated by the HMC. The transmission speed of Virtual Ethernet adapters is in the range of 1-3 Gigabits per second, depending on the transmission (MTU) size. The Virtual Ethernet Adapter supports, as Gigabit (Gb) Ethernet, Standard MTU-Sizes of 1500 Byte and Jumbo frames with 9000 Byte. Additionally for Gb Ethernet, the MTU-Size of 65280 Bytes is also supported in Virtual Ethernet. So, the MTU of 65280 Bytes can be only used inside a Virtual Ethernet. A partition can support up to 256 Virtual Ethernet adapters with each Virtual Ethernet capable of being associated with up to 18 VLANs. The Virtual Ethernet can also be used as a bootable device to allow such tasks as operating system installations to be performed using NIM.
  60. The POWER Hypervisor Switch is consistent with IEEE 802.1 Q. This standard defines the operation of virtual LAN (VLAN) bridges that permit the definition, operation, and administration of virtual LAN topologies within a bridged LAN infrastructure. It works on OSI-Layer 2 and supports up to 4096 networks (4096 VIDs). The Hypervisor works as a virtual Ethernet switch and maintains queues for each VLAN in its own memory. IEEE needs a Virtual LAN ID (VID). The LAN ID is optional in the above implementation. When this option is selected while adding a new Virtual LAN interface at the HMC, a VID can be chosen. Up to 4094 Virtual LANs are supported. Up to 18 VIDs can be configured per Virtual LAN port. The authority to communicate between LPARs is granted by configuring ports on a virtual Ethernet switch maintained by the Hypervisor. The switch configuration is defined using the HMC. When frames are sent across the network, a tag header is used to indicate to which VLAN a frame belongs. This ensures that the switch forwards the frame to only those ports that belong to that VLAN. Untagged packets are handled by adding the port VLAN identifier (PVID) to each frame.
  61. When a message arrives at a Logical LAN Switch port from a Logical LAN adapter, the POWER Hypervisor caches the message’s source MAC address to use as a filter for future messages to the Adapter. If the port is configured for VLAN headers, the VLAN header is checked against the port’s allowable VLAN list. If the message specified VLAN is not in the port’s configuration, the message is dropped. Once the message passes the VLAN header check, it passes into destination MAC address processing. If the port is NOT configured for VLAN headers, the Hypervisor (conceptually) inserts a two byte VLAN header (based upon the port’s configured VLAN number). Next, the destination MAC address is processed by searching the table of cached MAC addresses (built from messages received at Logical LAN Switch ports (see above)). If a match for the MAC address is not found and if there is no Trunk Adapter defined for the specified VLAN number, then the message is dropped; otherwise, if a match for the MAC address is not found and if there is a Trunk Adapter defined for the specified VLAN number, then the message is passed on to the Trunk Adapter. If a MAC address match is found, then the associated switch port’s configured, allowable VLAN number table is scanned for a match to VLAN number contained in the message’s VLAN header. If a match is not found, the message is dropped. Next, the VLAN header configuration of the destination switch port is checked. If the port is configured for VLAN headers, the message is delivered to the destination Logical LAN adapters, including any inserted VLAN header. If the port is configured for no VLAN headers, the VLAN header is removed before being delivered to the destination Logical LAN adapter.
  62. The measurements shown were taken using a 4-way POWER5 systems and AIX 5L V5.3 with several partitioning configurations. SMT (Simultaneous Multi Threading) is turned on for POWER5 systems. Virtual LAN adapters and the Gigabit Ethernet adapter default settings where used. The Virtual Ethernet connections generally take up more processor time than a local adapter to move a packet (DMA versus copy). For shared processor partitions, performance will be gated by the partition definitions (for example, entitled capacity and number of processors). Small partitions communicating with each other will experience more packet latency due to partition context switching. In general, high bandwidth applications should not be deployed in small shared processor partitions. For dedicated partitions, throughput should be comparable to a 1 Gigabit Ethernet for small packets providing much better performance than 1 Gigabit Ethernet for large packets. For large packets, the Virtual Ethernet communication is copy bandwidth limited. The throughput of the Virtual Ethernet scales nearly linear with the allocated capacity entitlements. The linear scaling of Virtual Ethernet with CPU entitlements shows that there is no measurable overhead when using shared processors versus dedicated processors for the throughput between Virtual LANs. Throughput is increasing, as expected, with growing MTU-Sizes (from MTU-Size 1500 to 9000 with factor ca. >3 and from 1500 to 65394 with factor >7). The Virtual Ethernet adapter has higher raw throughput at all MTU sizes. On MTU 9000, the difference in throughput is very large, due to the fact that the in-memory copy that Virtual Ethernet uses to transfer data is more efficient at larger MTU.
  63. The following are limitations that must be considered when implementing an Virtual Ethernet. Virtual Ethernet uses the system processors for all communication functions instead of offloading that load to processors on network adapter cards. As a result, there is an increase in system processor load generated by the use of Virtual Ethernet. (Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940)
  64. Because there is only a little experience with Virtual LANs until now, this guideline should not be taken as a good performance guarantee; they are only for orientation. Know your environment and the network traffic. Choose a high MTU size, as it makes sense for the network traffic in the Virtual LAN . Use the MTU size 65394 if you expect a large amount of data to be copied inside your Virtual LAN. Enable tcp_pmtu_discover and udp_pmtu_discover in conjunction with MTU size 65394, if there is a communication to physical adapters. Do not turn off SMT (Simultaneous Multi-Threading) unless your applications demand it. The throughput in Virtual LANs scale linear with CPU entitlements, so there is no need for dedicated CPUs for partitions because of Virtual LAN performance.
  65. There are two ways you can connect the Virtual Ethernet that enables the communication between logical partitions on the same server to an external network: routing and Shared Ethernet Adapter. By enabling the AIX routing capabilities (ipforwarding network option), one partition with a physical Ethernet adapter connected to an external network can act as router. In this type of configuration, the partition that routes the traffic to the external work does not necessarily have to be the virtual I/O server. It could be any partition with a connection to the outside world. The client partitions would have their default route set to the partition, which routes traffic to the external network. This example shows two systems with VLANs. The first one has an internal VLAN with subnet 3.1.1.x and the other one has subnet 4.1.1.x. The first system has a partition that routes the internal VLAN to an external LAN that has subnet 1.1.1.x. There is an other server connected to this subnet too (1.1.1.10). Similarly, the other system has a partition to route that systems internal VLAN to the external 2.1.1.x subnet. There is an external IP router that connects the two external subnets together.
  66. Using a Shared Ethernet Adapter (SEA), you can connect internal and external VLANs using one physical adapter. Shared Ethernet Adapter is a new service that acts as a layer 2 network switch to securely bridge network traffic from a Virtual Ethernet to a real network adapter. The Shared Ethernet Adapter service runs in the Virtual I/O server partition.
  67. The Shared Ethernet Adapter allows partitions to communicate outside the system without having to dedicate a physical I/O slot and a physical network adapter to a client partition. The Shared Ethernet Adapter has the following characteristics: Virtual Ethernet MAC are visible to outside systems. Broadcast/multicast is supported. ARP and NDP can work across a shared Ethernet. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server partition has to be configured with at least one physical Ethernet adapter. One Shared Ethernet Adapter can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server. A Virtual Ethernet adapter configured in the Shared Ethernet Adapter must have the trunk flag set. Once an Ethernet frame is sent from the Virtual Ethernet adapter on a client partition to the POWER Hypervisor, the POWER Hypervisor searches for the destination MAC address within the VLAN. If no such MAC address exists within the VLAN, it forwards the frame to the trunk Virtual Ethernet adapter that is defined on the same VLAN. The trunk Virtual Ethernet adapter enables a layer-2 bridge to a physical adapter. The shared Ethernet directs packets based on the VLAN ID tags. It learns this information based on observing the packets originating from the virtual adapters. One of the virtual adapters in the Shared Ethernet Adapter is designated as the default PVID adapter. Ethernet frames without any VLAN ID tags are directed to this adapter and assigned the default PVID. When the shared Ethernet receives IP (or IPv6) packets that are larger than the MTU of the adapter that the packet is forwarded through, either IP fragmentation is performed and the fragments forwarded or an ICMP packet too big message is returned to the source when the packet cannot be fragmented.
  68. Similar to Virtual SCSI, the POWER Hypervisor also provides the connection between different partitions when using Virtual Ethernet. Inside the server, the POWER Hypervisor acts like an Ethernet switch. The connection to the external network is done by the virtual I/O Servers shared Ethernet function. This part of the I/O Server acts as a Layer 2 bridge to the physical adapters. The Virtual Ethernet implementation fulfills the IEEE 802.1Q standard, which describes VLAN (virtual local area network) tagging. This means that a VLAN ID tag is inserted into every Ethernet frame. The Ethernet switch restricts the frames to the ports that are authorized to receive frames with that VLAN ID. Every port of an Ethernet switch can be configured to be a member of several VLANs. Only the network adapters, both virtual and physical ones, which are connected to a port (virtual or physical) that belongs to the same VLAN can receive these frames. The implementation of this VLAN standard ensures that the partitions have no access to foreign data.
  69. The measurements shown were taken using a 4-way POWER5 system and AIX 5L V5.3 with several partitioning configurations. SMT (Simultaneous Multi Threading) is turned on, on POWER5 systems. Virtual LAN adapters and the Gigabit Ethernet adapter default settings were used. The shared Ethernet adapter allows the adapters to stream data at media speed as long as it has enough CPU entitlements. This chart shows the throughput of the Virtual I/O-Server at MTU sizes of 1500 and 9000 in both modes, simplex and duplex. CPU utilization per Gigabit of throughput is higher with Shared Ethernet adapter, as it has to receive from one end and send it out the other end, and because of the bridging functionality in the Virtual I/O-Server.
  70. You must consider the following limitations when implementing Shared Ethernet Adapters in the Virtual I/O Server: Because Shared Ethernet Adapter depends on Virtual Ethernet, which uses the system processors for all communication functions, a significant amount of system processor load can be generated by the use of Virtual Ethernet and Shared Ethernet Adapter. One of the virtual adapters in the Shared Ethernet Adapter on the Virtual I/O Server must be defined as the default adapter with a default PVID. This virtual adapter is designated as the PVID adapter and Ethernet frames without any VLAN ID tags are assigned the default PVID and directed to this adapter. Up to 16 Virtual Ethernet adapters with 18 VLANs on each can be shared on a single physical network adapter. There is no limit on the number of partitions that can attach to a VLAN. So the theoretical limit is very high. In practice, the amount of network traffic will limit the number of clients that can be served through a single adapter. Shared Ethernet Adapter requires the POWER Hypervisor component of POWER5 systems and therefore cannot be used on POWER4 systems. It also cannot be used with AIX 5L Version 5.2, because the device drivers for Virtual Ethernet are only available for AIX 5L Version 5.3 and Linux. Thus, there is no way to connect a AIX 5L Version 5.2 system to a Shared Ethernet Adapter.
  71. Because there is only a little experience with Virtual I/O server and Shared Ethernet Adapter until now, these guidelines should not be taken as a good performance guarantee; they are only for orientation. Know your environment and the network traffic. Don’t use the Shared Ethernet Adapter functionality of the Virtual I/O-Server if you expect heavy network traffic between Virtual LANs and local networks. Use a dedicated network adapter instead. If possible, use dedicated CPUs for the Virtual I/O-Server (no shared processors). Choose 9000 for MTU size, if this makes sense for your network traffic. Don’t use the Shared Ethernet Adapter functionality of the Virtual I/O-Server for latency critical applications. With MTU size 1500, you need about 1 CPU per gigabit Ethernet adapter streaming at media speed. With MTU size 9000, 2 Gigabit Ethernet adapters can stream at media speed per CPU.
  72. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server has to be configured with at least one physical Ethernet adapter. One Shared Ethernet Adapter can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server. The chart shows a configuration example. A Shared Ethernet Adapter can include up to 16 Virtual Ethernet adapters that share the physical access.
  73. There are several different ways to configure physical and Virtual Ethernet adapters into Shared Ethernet Adapters to maximize throughput. Using several Shared Ethernet Adapters gives us more queues and more performance. An example for this configuration is shown in this chart.
  74. This chart shows a configuration using multipath routing and dead gateway detection. The client partition has two virtual Ethernet adapters. Each adapter is assigned to a different VLAN (using the PVID). Each virtual I/O server is configured with a Shared Ethernet Adapter, which bridges traffic between the virtual Ethernet and the external network. Each of the Shared Ethernet Adapters is assigned to a different VLAN (using PVID). By using two VLANs, network traffic is separated so that each virtual Ethernet adapter in the client partition seems to be connected to a different virtual I/O server. In the client partition, two default routes with dead gateway detection are defined. One route is going to gateway 9.3.5.10 via virtual Ethernet adapter with address 9.3.5.12. The second default route is going to gateway 9.3.5.20 using the virtual Ethernet adapter with address 9.3.5.22. In case of a failure of the primary route, access to the external network will be provided through the second route. AIX will detect route failure and adjust the cost of the route accordingly. Restriction: It is important to note that multipath routing and dead gateway detection do not make an IP address highly available. In case of the failure of one path, dead gateway detection will route traffic through an alternate path. The network adapters and their IP addresses remain unchanged. Therefore, when using Multipath routing and dead gateway detection, only your access to the network will become redundant, but not the IP addresses.
  75. For more details, refer to the Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940 redbook.
  76. For more details, refer to the Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940 redbook.