Sc08 Talk Final

Improving Throughput of Simultaneous
Multithreading (SMT) Processors using
Application Signatures and Thread Priorities

Mitesh R. Meswani
University of Texas at El Paso (UTEP)

11/20/2008 By Mitesh R. Meswani 1

Simultaneous Multithreading (SMT)
Utilization
Single-Threaded Execution
FP
Execution
FX
Units
LSU
Processor Cycles
1 2 3 4 5 6
Thread X uses
unused resource
SMT Execution Thread X waits
until resource is
FP
free, due to sharing
Execution
FX
Units
LSU
Processor Cycles
1 2 3 4 5 6

Thread-X Thread-Y No Thread
Legend: Executing Executing Executing
SMT with two hardware threads
• SMT hardware contexts share most of the processor resources
• Potential of 2x throughput with perfect resource sharing
• Throughput gains limited by contention of shared resources

Research Question and Hypothesis
• SMT-performance Tunables:
– Enable or disable SMT mode
– Prioritize one hardware thread over the other

• Research Question:What are the optimal priority
settings for best processor throughput?

• Hypothesis: Use hints from resource usage in
Single-threaded mode


Dissertation Contributions
1. Showed that prioritization of threads improves throughput:
Equal Priorities (default) are not best for nearly 47% of SPEC
CPU2000/6, Stream, and Lmbench benchmarks co-schedules

2. Defined and captured application “signatures” which are its
resource usage characteristics

3. Showed that a small set of signatures are present in real world
applications: 16 Signatures are sufficient to represent 95.5% of
execution time of SPEC CPU2006 (20) benchmarks, NAS NPB3.2 Serial
(9) benchmarks, PETSc KSP (119), and PETSc Matrix (180) libraries

4. Developed a prediction methodology using microbenchmarks
that represent signatures, and showed that our predictions
have the potential to improve throughput: 87% of PETSc KSP
coschedules experience better throughput with predicted priorities
than default


Thread Priorities in IBM POWER5
• Six out of eight priorities available to the operating system
for normal mode of operation: 1, 2, 3, 4 (default), 5, and 6
• Difference in hardware thread priorities control decode
cycle sharing
Thread X Thread Y Thread X Thread Y
Priority
Priority Priority Decode Decode
Difference
Cycles Cycles
6 1 63/64 1/64
5
6 2 31/32 1/32
4
6 3 15/16 1/16
3
6 4 7/8 1/8
2
6 5 3/4 1/4
1
1/2 1/2
4 (default) 4 (default) 0


Signatures
1. Identify Significant Resources : Floating-point unit (FPU),
Fixed-point unit (FXU), L2 unified cache, and L2 unified TLB

2. Capture using performance counters

3. Define utilization levels of resources in Single-Threaded
mode, forming a signature
– Ten utilization levels L1 to L10 per resource
– Example: L1L2L3L9, L9L6L7L8, L2L3L10L6…


Work Flow
Step 1: Find Signatures of Real Applications Step 3:Execute Application Pairs using
Predicted Priorities
Serial Application
Application Pair A, B
Run Application and Performance
Single-
Periodically Sample Counter
Threaded
Counters Settings
Mode Signature
Read Signatures
Signatures Data Base

Signature Data Run Pair A, B
Found
Base No with Equal
Dominating
Priorities in
Signatures ?
SMT Mode
Step 2: Create Signature Microbenchmarks for
Frequently Appearing Signatures and Empirically
Yes Signature of A,B
Find Priority Predictions
Prediction
Read Priorities
Signature-microbenchmark Pair X, Y
Data Base
Priority of A,
Store CPI for all
Run Signature-
Priorities
Priority of B
priorities for
Microbenchmark
i, j in SMT
Pair X, Y
CPI
Pair Run Pair A, B with
Mode
Predicted
Identify Best Priorities in SMT
Predictions
Case Priority for
Prediction Data Mode
Pair X, Y
Base


Details of Step 1
• Four groups of counters were measured
• Each group measured in separate runs
• Sampled in one second time intervals
Run 1

Run 2
Interval 0
Run 3

Run 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Sample#
•The difference between the execution time across the 4 runs was negligible
•For 99% of samples, the difference between the number of instructions and run
cycles was negligible


Different Signatures are Present in
Real Applications
L1L1L1L1
Signature Histogram of Four SPEC CPU2006 and Two PETSc KSP Library Functions
L3L1L1L1
100%
L3L2L1L1
90% L2L1L1L1
L2L3L1L1
80%
% of Total Cycles

L2L2L1L1
L1L4L1L1
70% L1L1L9L5
L1L2L7L4
60% L1L1L7L4
L1L1L6L4
50% L1L2L6L3
L1L2L5L2
40%
L1L3L1L1
L1L2L2L1
30%
L1L2L3L1
L1L2L6L4
20%
L1L2L5L4
L1L2L5L3
10%
L1L2L4L3
L1L2L4L2
0%
L1L2L3L2
429.mcf 416.gamess 444.namd 462.libquantum cgs gmres
L1L1L2L1
Applications L1L2L1L1

Conclusions
1. Showed that equal priorities (default) are not the best
for nearly 47% of applications studied

2. Only 16 Signatures are sufficient to represent 95.5% of
execution time of 20 SPEC CPU2006 benchmarks, 9 NAS
NPB3.2 Serial benchmarks, 119 PETSc KSP, and 180
PETSc Matrix libraries

3. Priority predictions using signature benchmarks
improve throughput over default settings for 87% of the
15 PETSc KSP coschedules.


Applications with Multiple Signatures


Future Work and References
Future Work:
• Identify applications with multiple signatures
• Dynamic adaptation of priorities
• Detecting signatures on the fly
• Phase detection and Prediction for a truly adaptive system

References:
• M. R. Meswani, P. J. Teller, and S. Arunangiri., “A Study of the Influence
of the POWER5 Dynamic Resource Balancing Hardware on Optimal
Hardware Thread Priorities,” To Appear in the Proceedings of the 2008
Live Virtual Constructive Conference, Jan 2009, El Paso, TX
• M. R. Meswani and P. J. Teller, “ Evaluating the Performance Impact of
Hardware Thread Priorities in Simultaneous Multithreaded Processors
using SPEC CPU2000,” Proceedings of the 2nd International Workshop on
Operating Systems Interference In High Performance Applications, in
conjunction with the 15th International Conferences on Parallel
Architectures and Compilation Techniques (PACT06)
Conference, sponsored by ACM and IEEE, September 2006, Seattle, WA.

Acknowledgements
• This work is supported by AHPCRC Grant W11NF-
07-2-2007

• Amir Simon, IBM for his valuable assistance with
fixing the firmware of the p550 machine


Questions?


EXTRA SLIDES


Simultaneous Multithreading (SMT)
Instruction
Program Instruction Write
FPU
TLB
Counter-X Buffer-X Back-X

Instruction Decode FXU
Fetch

Instruction
Program Instruction Write
LSU
Cache
Counter-Y Buffer-Y Back-Y

Data Data
TLB Cache

Thread-Y Resource
Thread-X Resource Shared Resource
Legend:

SMT hardware contexts share most of the processor resources.


Methodology Overview - 1
1. Identify significant subset of shared resources
– Resources Identified: L2 unified cache, L2 unified
TLB, Floating-point unit (FPU), and Fixed-point unit (FXU)
2. Identify and validate performance counters
3. Define utilization levels of resources in Single-Threaded
mode, forming a signature
– Ten utilization levels L1 to L10 per resource: L1 is 0%-10%, L2 is
11%-20%, …, L10 is 90%-100%
– A signature is represented as utilization levels (L1-L10) of
FPU, FXU, L2 cache, and L2 TLB.
– Example: L1L2L3L9, L9L6L7L8, …
4. An application is said to have one dominating
signature, if the signature is associated with at least 80%
of the application execution time


Results – 2: Small Subset of Signatures are Sufficient to
Represent Majority of the Execution Time of Applications
16 Signatures are Sufficient to Represent 95.6% of Execution Time of 20 SPEC
CPU2006, 9 NAS NPB3.2 Serial, 119 PETSc KSP, and 180 PETSc Matrix Benchmarks
L1L3L1L1
4.4%
L1L1L1L1
1.4%
L1L2L1L1
1.9%
L2L3L1L1
3.5% 2.4% 16.3%
L3L1L3L1
L4L1L1L1
3.8%
L2L1L1L1
3.9%
L2L2L1L1
13.2%
4.3%
L3L1L1L1
4.8% L1L2L2L1
5.0% L2L2L2L1
12.0%
L1L2L3L1
5.3%
L1L1L2L1
5.3% 6.8%
5.6% L2L1L2L1
L5L1L1L1
L3L1L2L1
Others (19)

Results –Priority Predictions using Signature
Benchmarks can Potentially Improve Throughput
Signature Signature
Prediction Thread X Thread X Thread Y Thread Y Best Case Worst Case
6-5 bicg L1L2L1L1 bicg L1L2L1L1 6-6 3-6
4-6 bicg L1L2L1L1 lsqr L1L3L1L1 6-6 2-6
5-6 bicg L1L2L1L1 tcqmr L1L1L1L1 6-2 1-6
6-6 lsqr L1L3L1L1 lsqr L1L3L1L1 6-5 1-6
5-6 lsqr L1L3L1L1 tcqmr L1L1L1L1 6-2 1-6
6-5 tcqmr L1L1L1L1 tcqmr L1L1L1L1 6-5 3-6
6-5 bcgs L1L1L1L1 bcgs L1L1L1L1 6-5 2-6
6-5 bcgs L1L1L1L1 bicg L1L2L1L1 6-5 2-6
6-5 bcgs L1L1L1L1 cgs L1L1L1L1 6-5 3-6
6-5 bcgs L1L1L1L1 chebychev L1L1L1L1 6-1 3-6
6-5 bcgs L1L1L1L1 cr L1L1L1L1 6-1 1-6
6-5 bcgs L1L1L1L1 gmres L1L1L1L1 6-1 2-6
6-5 bcgs L1L1L1L1 lsqr L1L3L1L1 6-5 1-6
6-5 bcgs L1L1L1L1 richardson L1L1L1L1 6-1 3-6
6-5 bcgs L1L1L1L1 tcqmr L1L1L1L1 6-1 1-6
For 15 PETSc KSP co-schedules, predicted settings
• improved throughput over default for 87% of co-schedules,
• are the best for 33% of co-schedules, and
• are never the worst case settings

Signatures in Applications
• PETSc Linear Solvers
• Identify signature using performance counters
• Results:
• STORY:
– Using simulator, showed that intelligent settings of
hardware thread priorities can enhance workload
performance
– Critical microarchitecture resource usage “signatures” can
be used to determine “best” priorities
– Different signatures exist in real-world applications and
have been shown to be useful in enhancing utilization
and throughput


Signatures and Application Phases

Interval Phase Transitions Consecutive Phases

Application executions are composed of multiple
•
phases
For each phase in Single-Threaded mode, monitor
•
utilization of shared resources for each phase
(Signature)
Resource utilization can be used to estimate
•
availability of resources for other threads
Given signatures of two threads, predict thread
•
priorities that maximize overall throughput

POWER5 Chip
• POWER5 Chip:Two identical cores, each core
with two SMT threads, 64KB L1 ICache, 32KB L1
DCache, Shared Unified 1.92MB L2 Cache, off-
chip 36MB L3 Cache, 128-entry L1 ITLB, 128-
entry L1 DTLB, and 1024-entry Unified L2 TLB


FPU and FXU Benchmark
FPU Benchmark for Maximum Utilization (99%) FXU Benchmark for Maximum Utilization (70%)
Loop: Loop:
fadd R0,R0,R0, addi R0,R0,0
… …
fadd R31,R31,R31 addi R31,R31,31
above block copied four times above block copied six times
count++; count++;
branch to loop if count<max branch to loop if count<max

• Benchmarks runs for 100s in Single-Threaded mode
• Data dependencies and noops are introduced to
lower utilization levels
• Utilization achieved was:
– FPU : 10% to 99%
– FXU : 10% to 70%

L2 Cache and L2 TLB Benchmark
L2 Cache Benchmark for Maximum Utilization (99%) L2 TLB Benchmark for Maximum Utilization (99%)
1. Allocate array bigger than L2 cache 1. Allocate array bigger than number of pages mapped
2. First element of cache line 1 points to first element of by TLB entries
line 4, which points to first element of line 7, and so 2. First element of a page to first element of next page;
on; stride is 3 cache lines stride is one page
3. Main body implements pointer chasing shown below: 3. Main body implements pointer chasing shown below:

for(j=0;j<1000000;j++) for(j=0;j<400000;j++)
{ elem=(int *)arr[0]; //initialize to point to first element { elem=(int *)arr[0]; //initialize to point to first element
while(elem!=NULL) // continue while not last line while(elem!=NULL) // continue while not last page
elem=(int *)*elem; // load address of line + stride elem=(int *)*elem; // load address of next page
} }

• Benchmarks runs for 100s in Single-Threaded mode
• Repeated access to an element are introduced in the
while loop to reduced utilization levels
• Utilization was achieved in the range of 10% to 99%


Multi-resource Signature Benchmark
LLHH Benchmark HHLL Benchmark
1. Allocate array bigger than number of pages mapped 1. Allocate array bigger than L2 cache
by TLB entries 2. First element of a line points to first element of next
2. First element of a page to first element of next page; line; stride is one cache line
stride is one page 3. Main body consists of floating-point and integer
3. Main body implements pointer chasing and a few operations and pointer chasing shown below:
floating-point and integer operations shown below: for(j=0;j<9000;j++)
for(j=0;j<390000;j++) { elem=(int *)arr[0]; //initialize to point to first element
{ elem=(int *)arr[0]; //initialize to point to first element while(elem!=NULL) // continue while not last line
while(elem!=NULL) // continue while not last page { 168 floating-point additions;
{ elem=(int *)*elem; // load address of next page 168 integer additions;
8 floating-point additions; elem=(int *)*elem; // load address of next line
8 integer additions; }
} }
}

• Loop body varies number of fpu, fxu operations and stride access to achieve desired
signature
• Each benchmark runs for 100s in Single-Threaded mode
• Total of 12 signatures out of 16 possible were developed,
– Signatures developed are: LLLL, LLHL, LLHH, LHLL, LHHL, LHHH, HLLL, HLHL, HLHH, HHLL, HHHL,
HHHH
– Signatures with low utilization of L2 cache and high utilization of TLB were not developed, namely
LLLH, LHLH, HLLH, HHLH


Sc08 Talk Final

Recommended

Recommended

More Related Content

Similar to Sc08 Talk Final

Similar to Sc08 Talk Final (20)

Recently uploaded

Recently uploaded (20)

Sc08 Talk Final