SlideShare a Scribd company logo
Improving Throughput of Simultaneous
            Multithreading (SMT) Processors using
         Application Signatures and Thread Priorities

                     Mitesh R. Meswani
             University of Texas at El Paso (UTEP)



11/20/2008                By Mitesh R. Meswani          1
Simultaneous Multithreading (SMT)
                         Utilization
                                       Single-Threaded Execution
                     FP
       Execution
                     FX
       Units
                     LSU
                                                                                 Processor Cycles
                              1        2        3         4      5       6
    Thread X uses
   unused resource
                                           SMT Execution                                  Thread X waits
                                                                                         until resource is
                     FP
                                                                                       free, due to sharing
       Execution
                     FX
       Units
                     LSU
                                                                                 Processor Cycles
                             1         2        3        4           5       6

                           Thread-X                  Thread-Y                    No Thread
 Legend:                   Executing                 Executing                   Executing
       SMT with two hardware threads
           • SMT hardware contexts share most of the processor resources
           • Potential of 2x throughput with perfect resource sharing
           • Throughput gains limited by contention of shared resources
11/20/2008                                  By Mitesh R. Meswani                                              2
Research Question and Hypothesis
 • SMT-performance Tunables:
       – Enable or disable SMT mode
       – Prioritize one hardware thread over the other


 • Research Question:What are the optimal priority
   settings for best processor throughput?

 • Hypothesis: Use hints from resource usage in
   Single-threaded mode

11/20/2008                 By Mitesh R. Meswani          3
Dissertation Contributions
 1. Showed that prioritization of threads improves throughput:
       Equal Priorities (default) are not best for nearly 47% of SPEC
       CPU2000/6, Stream, and Lmbench benchmarks co-schedules

 2. Defined and captured application “signatures” which are its
    resource usage characteristics

 3. Showed that a small set of signatures are present in real world
    applications: 16 Signatures are sufficient to represent 95.5% of
       execution time of SPEC CPU2006 (20) benchmarks, NAS NPB3.2 Serial
       (9) benchmarks, PETSc KSP (119), and PETSc Matrix (180) libraries

 4. Developed a prediction methodology using microbenchmarks
    that represent signatures, and showed that our predictions
    have the potential to improve throughput: 87% of PETSc KSP
       coschedules experience better throughput with predicted priorities
       than default

11/20/2008                       By Mitesh R. Meswani                       4
Thread Priorities in IBM POWER5
 • Six out of eight priorities available to the operating system
   for normal mode of operation: 1, 2, 3, 4 (default), 5, and 6
 • Difference in hardware thread priorities control decode
   cycle sharing
     Thread X         Thread Y                             Thread X   Thread Y
                                     Priority
      Priority         Priority                             Decode     Decode
                                    Difference
                                                             Cycles     Cycles
             6            1                                 63/64       1/64
                                          5
             6            2                                 31/32       1/32
                                          4
             6            3                                 15/16       1/16
                                          3
             6            4                                  7/8        1/8
                                          2
             6            5                                  3/4        1/4
                                          1
                                                             1/2        1/2
     4 (default)      4 (default)         0

11/20/2008                          By Mitesh R. Meswani                         5
Signatures
 1. Identify Significant Resources : Floating-point unit (FPU),
    Fixed-point unit (FXU), L2 unified cache, and L2 unified TLB


 2. Capture using performance counters


 3. Define utilization levels of resources in Single-Threaded
    mode, forming a signature
      – Ten utilization levels L1 to L10 per resource
      – Example: L1L2L3L9, L9L6L7L8, L2L3L10L6…



11/20/2008                    By Mitesh R. Meswani                 6
Work Flow
 Step 1: Find Signatures of Real Applications                       Step 3:Execute Application Pairs using
                                                                    Predicted Priorities
                       Serial Application
                                                                         Application Pair A, B
                       Run Application and           Performance
    Single-
                       Periodically Sample             Counter
   Threaded
                            Counters                   Settings
    Mode                                                                                                       Signature
                                                                              Read Signatures
                                   Signatures                                                                  Data Base

                          Signature Data                                                                     Run Pair A, B
                                                                                   Found
                               Base                                                                     No    with Equal
                                                                                Dominating
                                                                                                              Priorities in
                                                                                Signatures ?
                                                                                                              SMT Mode
 Step 2: Create Signature Microbenchmarks for
 Frequently Appearing Signatures and Empirically
                                                                                 Yes   Signature of A,B
 Find Priority Predictions
                                                                                                               Prediction
                                                                              Read Priorities
               Signature-microbenchmark Pair X, Y
                                                                                                               Data Base
                                                                                       Priority of A,
                                        Store CPI for all
                     Run Signature-
  Priorities
                                                                                       Priority of B
                                          priorities for
                    Microbenchmark
 i, j in SMT
                                            Pair X, Y
                                    CPI
                          Pair                                               Run Pair A, B with
     Mode
                                                                                  Predicted
                                                 Identify Best                Priorities in SMT
                                 Predictions
                                                Case Priority for
               Prediction Data                                                      Mode
                                                   Pair X, Y
                    Base

11/20/2008                                            By Mitesh R. Meswani                                                    7
Details of Step 1
 • Four groups of counters were measured
 • Each group measured in separate runs
 • Sampled in one second time intervals
                                                                           Run 1

                                                                           Run 2
Interval 0
                                                                           Run 3

                                                                           Run 4

             0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
                                                             Sample#
  •The difference between the execution time across the 4 runs was negligible
  •For 99% of samples, the difference between the number of instructions and run
  cycles was negligible

11/20/2008                       By Mitesh R. Meswani                              8
Different Signatures are Present in
                                           Real Applications
                                                                                                          L1L1L1L1
                           Signature Histogram of Four SPEC CPU2006 and Two PETSc KSP Library Functions
                                                                                                          L3L1L1L1
              100%
                                                                                                          L3L2L1L1
                     90%                                                                                  L2L1L1L1
                                                                                                          L2L3L1L1
                     80%
 % of Total Cycles




                                                                                                          L2L2L1L1
                                                                                                          L1L4L1L1
                     70%                                                                                  L1L1L9L5
                                                                                                          L1L2L7L4
                     60%                                                                                  L1L1L7L4
                                                                                                          L1L1L6L4
                     50%                                                                                  L1L2L6L3
                                                                                                          L1L2L5L2
                     40%
                                                                                                          L1L3L1L1
                                                                                                          L1L2L2L1
                     30%
                                                                                                          L1L2L3L1
                                                                                                          L1L2L6L4
                     20%
                                                                                                          L1L2L5L4
                                                                                                          L1L2L5L3
                     10%
                                                                                                          L1L2L4L3
                                                                                                          L1L2L4L2
                     0%
                                                                                                          L1L2L3L2
                               429.mcf     416.gamess   444.namd    462.libquantum   cgs         gmres
                                                                                                          L1L1L2L1
                                                            Applications                                  L1L2L1L1
11/20/2008                                                 By Mitesh R. Meswani                                      9
Conclusions
 1. Showed that equal priorities (default) are not the best
    for nearly 47% of applications studied

 2. Only 16 Signatures are sufficient to represent 95.5% of
    execution time of 20 SPEC CPU2006 benchmarks, 9 NAS
    NPB3.2 Serial benchmarks, 119 PETSc KSP, and 180
    PETSc Matrix libraries

 3. Priority predictions using signature benchmarks
    improve throughput over default settings for 87% of the
    15 PETSc KSP coschedules.

11/20/2008               By Mitesh R. Meswani                 10
Applications with Multiple Signatures




11/20/2008                 By Mitesh R. Meswani      11
Future Work and References
 Future Work:
 • Identify applications with multiple signatures
 • Dynamic adaptation of priorities
 • Detecting signatures on the fly
 • Phase detection and Prediction for a truly adaptive system

 References:
 • M. R. Meswani, P. J. Teller, and S. Arunangiri., “A Study of the Influence
    of the POWER5 Dynamic Resource Balancing Hardware on Optimal
    Hardware Thread Priorities,” To Appear in the Proceedings of the 2008
    Live Virtual Constructive Conference, Jan 2009, El Paso, TX
 • M. R. Meswani and P. J. Teller, “ Evaluating the Performance Impact of
    Hardware Thread Priorities in Simultaneous Multithreaded Processors
    using SPEC CPU2000,” Proceedings of the 2nd International Workshop on
    Operating Systems Interference In High Performance Applications, in
    conjunction with the 15th International Conferences on Parallel
    Architectures and Compilation Techniques (PACT06)
    Conference, sponsored by ACM and IEEE, September 2006, Seattle, WA.
11/20/2008                     By Mitesh R. Meswani                             12
Acknowledgements
 • This work is supported by AHPCRC Grant W11NF-
   07-2-2007

 • Amir Simon, IBM for his valuable assistance with
   fixing the firmware of the p550 machine




11/20/2008           By Mitesh R. Meswani             13
Questions?




11/20/2008    By Mitesh R. Meswani   14
EXTRA SLIDES




11/20/2008      By Mitesh R. Meswani   15
Simultaneous Multithreading (SMT)
                  Instruction
   Program                         Instruction                                               Write
                                                                               FPU
                      TLB
   Counter-X                         Buffer-X                                                Back-X



                   Instruction                          Decode                 FXU
                      Fetch




                  Instruction
   Program                         Instruction                                               Write
                                                                               LSU
                     Cache
   Counter-Y                         Buffer-Y                                                Back-Y


                                                                        Data         Data
                                                                         TLB         Cache


                                                          Thread-Y Resource
                           Thread-X Resource                                            Shared Resource
    Legend:


               SMT hardware contexts share most of the processor resources.

11/20/2008                                       By Mitesh R. Meswani                                     16
Methodology Overview - 1
 1. Identify significant subset of shared resources
       – Resources Identified: L2 unified cache, L2 unified
         TLB, Floating-point unit (FPU), and Fixed-point unit (FXU)
 2. Identify and validate performance counters
 3. Define utilization levels of resources in Single-Threaded
    mode, forming a signature
       – Ten utilization levels L1 to L10 per resource: L1 is 0%-10%, L2 is
         11%-20%, …, L10 is 90%-100%
       – A signature is represented as utilization levels (L1-L10) of
         FPU, FXU, L2 cache, and L2 TLB.
       – Example: L1L2L3L9, L9L6L7L8, …
 4. An application is said to have one dominating
    signature, if the signature is associated with at least 80%
    of the application execution time

11/20/2008                      By Mitesh R. Meswani                          17
Results – 2: Small Subset of Signatures are Sufficient to
             Represent Majority of the Execution Time of Applications
      16 Signatures are Sufficient to Represent 95.6% of Execution Time of 20 SPEC
    CPU2006, 9 NAS NPB3.2 Serial, 119 PETSc KSP, and 180 PETSc Matrix Benchmarks
                                                                              L1L3L1L1
                                           4.4%
                                                                              L1L1L1L1
                                 1.4%
                                                                              L1L2L1L1
                             1.9%
                                                                              L2L3L1L1
                      3.5%        2.4%                   16.3%
                                                                              L3L1L3L1
                                                                              L4L1L1L1
                          3.8%
                                                                              L2L1L1L1
                     3.9%
                                                                              L2L2L1L1
                                                                      13.2%
                    4.3%
                                                                              L3L1L1L1
                   4.8%                                                       L1L2L2L1
                     5.0%                                                     L2L2L2L1
                                                                 12.0%
                                                                              L1L2L3L1
                             5.3%
                                                                              L1L1L2L1
                                        5.3%          6.8%
                                               5.6%                           L2L1L2L1
                                                                              L5L1L1L1
                                                                              L3L1L2L1
                                                                              Others (19)
11/20/2008                                     By Mitesh R. Meswani                         18
Results –Priority Predictions using Signature
               Benchmarks can Potentially Improve Throughput
                           Signature                          Signature
  Prediction    Thread X   Thread X         Thread Y          Thread Y    Best Case Worst Case
     6-5          bicg     L1L2L1L1            bicg           L1L2L1L1       6-6       3-6
     4-6          bicg     L1L2L1L1            lsqr           L1L3L1L1       6-6       2-6
     5-6          bicg     L1L2L1L1           tcqmr           L1L1L1L1       6-2       1-6
     6-6          lsqr     L1L3L1L1            lsqr           L1L3L1L1       6-5       1-6
     5-6          lsqr     L1L3L1L1           tcqmr           L1L1L1L1       6-2       1-6
     6-5         tcqmr     L1L1L1L1           tcqmr           L1L1L1L1       6-5       3-6
     6-5          bcgs     L1L1L1L1            bcgs           L1L1L1L1       6-5       2-6
     6-5          bcgs     L1L1L1L1            bicg           L1L2L1L1       6-5       2-6
     6-5          bcgs     L1L1L1L1            cgs            L1L1L1L1       6-5       3-6
     6-5          bcgs     L1L1L1L1        chebychev          L1L1L1L1       6-1       3-6
     6-5          bcgs     L1L1L1L1             cr            L1L1L1L1       6-1       1-6
     6-5          bcgs     L1L1L1L1           gmres           L1L1L1L1       6-1       2-6
     6-5          bcgs     L1L1L1L1            lsqr           L1L3L1L1       6-5       1-6
     6-5          bcgs     L1L1L1L1        richardson         L1L1L1L1       6-1       3-6
     6-5          bcgs     L1L1L1L1           tcqmr           L1L1L1L1       6-1       1-6
                  For 15 PETSc KSP co-schedules, predicted settings
                  • improved throughput over default for 87% of co-schedules,
                  • are the best for 33% of co-schedules, and
                  • are never the worst case settings
11/20/2008                             By Mitesh R. Meswani                                  19
Signatures in Applications
 •   PETSc Linear Solvers
 •   Identify signature using performance counters
 •   Results:
 •   STORY:
       – Using simulator, showed that intelligent settings of
         hardware thread priorities can enhance workload
         performance
       – Critical microarchitecture resource usage “signatures” can
         be used to determine “best” priorities
       – Different signatures exist in real-world applications and
         have been shown to be useful in enhancing utilization
         and throughput


11/20/2008                   By Mitesh R. Meswani                 20
Signatures and Application Phases


 Interval            Phase Transitions              Consecutive Phases


      Application executions are composed of multiple
  •
      phases
      For each phase in Single-Threaded mode, monitor
  •
      utilization of shared resources for each phase
      (Signature)
      Resource utilization can be used to estimate
  •
      availability of resources for other threads
      Given signatures of two threads, predict thread
  •
      priorities that maximize overall throughput
11/20/2008                   By Mitesh R. Meswani                        21
POWER5 Chip
 • POWER5 Chip:Two identical cores, each core
   with two SMT threads, 64KB L1 ICache, 32KB L1
   DCache, Shared Unified 1.92MB L2 Cache, off-
   chip 36MB L3 Cache, 128-entry L1 ITLB, 128-
   entry L1 DTLB, and 1024-entry Unified L2 TLB




11/20/2008          By Mitesh R. Meswani           22
FPU and FXU Benchmark
  FPU Benchmark for Maximum Utilization (99%)       FXU Benchmark for Maximum Utilization (70%)
  Loop:                                             Loop:
  fadd R0,R0,R0,                                    addi R0,R0,0
             …                                                 …
  fadd R31,R31,R31                                  addi R31,R31,31
             above block copied four times                     above block copied six times
             count++;                                          count++;
  branch to loop if count<max                       branch to loop if count<max



 • Benchmarks runs for 100s in Single-Threaded mode
 • Data dependencies and noops are introduced to
   lower utilization levels
 • Utilization achieved was:
       – FPU : 10% to 99%
       – FXU : 10% to 70%
11/20/2008                             By Mitesh R. Meswani                                       23
L2 Cache and L2 TLB Benchmark
L2 Cache Benchmark for Maximum Utilization (99%)               L2 TLB Benchmark for Maximum Utilization (99%)
1. Allocate array bigger than L2 cache                         1. Allocate array bigger than number of pages mapped
2. First element of cache line 1 points to first element of         by TLB entries
     line 4, which points to first element of line 7, and so   2. First element of a page to first element of next page;
     on; stride is 3 cache lines                                    stride is one page
3. Main body implements pointer chasing shown below:           3. Main body implements pointer chasing shown below:

for(j=0;j<1000000;j++)                                         for(j=0;j<400000;j++)
{ elem=(int *)arr[0]; //initialize to point to first element   { elem=(int *)arr[0]; //initialize to point to first element
   while(elem!=NULL) // continue while not last line              while(elem!=NULL) // continue while not last page
elem=(int *)*elem; // load address of line + stride            elem=(int *)*elem; // load address of next page
}                                                              }



 • Benchmarks runs for 100s in Single-Threaded mode
 • Repeated access to an element are introduced in the
   while loop to reduced utilization levels
 • Utilization was achieved in the range of 10% to 99%

11/20/2008                                         By Mitesh R. Meswani                                                   24
Multi-resource Signature Benchmark
  LLHH Benchmark                                                 HHLL Benchmark
  1. Allocate array bigger than number of pages mapped           1. Allocate array bigger than L2 cache
       by TLB entries                                            2. First element of a line points to first element of next
  2. First element of a page to first element of next page;           line; stride is one cache line
       stride is one page                                        3. Main body consists of floating-point and integer
  3. Main body implements pointer chasing and a few                   operations and pointer chasing shown below:
       floating-point and integer operations shown below:        for(j=0;j<9000;j++)
  for(j=0;j<390000;j++)                                          { elem=(int *)arr[0]; //initialize to point to first element
  { elem=(int *)arr[0]; //initialize to point to first element      while(elem!=NULL) // continue while not last line
     while(elem!=NULL) // continue while not last page               { 168 floating-point additions;
      { elem=(int *)*elem; // load address of next page                  168 integer additions;
          8 floating-point additions;                            elem=(int *)*elem; // load address of next line
          8 integer additions;                                        }
       }                                                         }
  }

 •    Loop body varies number of fpu, fxu operations and stride access to achieve desired
      signature
 •    Each benchmark runs for 100s in Single-Threaded mode
 •    Total of 12 signatures out of 16 possible were developed,
        – Signatures developed are: LLLL, LLHL, LLHH, LHLL, LHHL, LHHH, HLLL, HLHL, HLHH, HHLL, HHHL,
          HHHH
        – Signatures with low utilization of L2 cache and high utilization of TLB were not developed, namely
          LLLH, LHLH, HLLH, HHLH

11/20/2008                                        By Mitesh R. Meswani                                                    25

More Related Content

Similar to Sc08 Talk Final

06threadsimp
06threadsimp06threadsimp
06threadsimp
Zhiwen Guo
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
Daniel Blezek
 
Ch04
Ch04Ch04
Ch04
fawad124
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
Ben Stopford
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
Jeff Kunkle
 
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
BIOVIA
 
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
Tugdual Grall
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
Jeff Larkin
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
darach
 
Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09
smarru
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
 
Florian adler minute project
Florian adler   minute projectFlorian adler   minute project
Florian adler minute project
Dmitry Buzdin
 
FreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk DevelopersFreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk Developers
Moises Silva
 
Tim - FSharp
Tim - FSharpTim - FSharp
Tim - FSharp
d0nn9n
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
kingargyle
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
lucenerevolution
 
Simulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to SupercomputersSimulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to Supercomputers
Eric Van Hensbergen
 
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
Alex Gorbachev
 
Scaling up java applications on windows
Scaling up java applications on windowsScaling up java applications on windows
Scaling up java applications on windows
Juarez Junior
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage Engineers
Thomas Kejser
 

Similar to Sc08 Talk Final (20)

06threadsimp
06threadsimp06threadsimp
06threadsimp
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
Ch04
Ch04Ch04
Ch04
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
 
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
 
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
FOSDEM 2013 : Getting Started with Couchhbase Server 2.0
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Florian adler minute project
Florian adler   minute projectFlorian adler   minute project
Florian adler minute project
 
FreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk DevelopersFreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk Developers
 
Tim - FSharp
Tim - FSharpTim - FSharp
Tim - FSharp
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
 
Simulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to SupercomputersSimulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to Supercomputers
 
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
UTHOC2 - Under The Hood of Oracle Clusterware 2.0 - Grid Infrastructure by Al...
 
Scaling up java applications on windows
Scaling up java applications on windowsScaling up java applications on windows
Scaling up java applications on windows
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage Engineers
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 

Sc08 Talk Final

  • 1. Improving Throughput of Simultaneous Multithreading (SMT) Processors using Application Signatures and Thread Priorities Mitesh R. Meswani University of Texas at El Paso (UTEP) 11/20/2008 By Mitesh R. Meswani 1
  • 2. Simultaneous Multithreading (SMT) Utilization Single-Threaded Execution FP Execution FX Units LSU Processor Cycles 1 2 3 4 5 6 Thread X uses unused resource SMT Execution Thread X waits until resource is FP free, due to sharing Execution FX Units LSU Processor Cycles 1 2 3 4 5 6 Thread-X Thread-Y No Thread Legend: Executing Executing Executing SMT with two hardware threads • SMT hardware contexts share most of the processor resources • Potential of 2x throughput with perfect resource sharing • Throughput gains limited by contention of shared resources 11/20/2008 By Mitesh R. Meswani 2
  • 3. Research Question and Hypothesis • SMT-performance Tunables: – Enable or disable SMT mode – Prioritize one hardware thread over the other • Research Question:What are the optimal priority settings for best processor throughput? • Hypothesis: Use hints from resource usage in Single-threaded mode 11/20/2008 By Mitesh R. Meswani 3
  • 4. Dissertation Contributions 1. Showed that prioritization of threads improves throughput: Equal Priorities (default) are not best for nearly 47% of SPEC CPU2000/6, Stream, and Lmbench benchmarks co-schedules 2. Defined and captured application “signatures” which are its resource usage characteristics 3. Showed that a small set of signatures are present in real world applications: 16 Signatures are sufficient to represent 95.5% of execution time of SPEC CPU2006 (20) benchmarks, NAS NPB3.2 Serial (9) benchmarks, PETSc KSP (119), and PETSc Matrix (180) libraries 4. Developed a prediction methodology using microbenchmarks that represent signatures, and showed that our predictions have the potential to improve throughput: 87% of PETSc KSP coschedules experience better throughput with predicted priorities than default 11/20/2008 By Mitesh R. Meswani 4
  • 5. Thread Priorities in IBM POWER5 • Six out of eight priorities available to the operating system for normal mode of operation: 1, 2, 3, 4 (default), 5, and 6 • Difference in hardware thread priorities control decode cycle sharing Thread X Thread Y Thread X Thread Y Priority Priority Priority Decode Decode Difference Cycles Cycles 6 1 63/64 1/64 5 6 2 31/32 1/32 4 6 3 15/16 1/16 3 6 4 7/8 1/8 2 6 5 3/4 1/4 1 1/2 1/2 4 (default) 4 (default) 0 11/20/2008 By Mitesh R. Meswani 5
  • 6. Signatures 1. Identify Significant Resources : Floating-point unit (FPU), Fixed-point unit (FXU), L2 unified cache, and L2 unified TLB 2. Capture using performance counters 3. Define utilization levels of resources in Single-Threaded mode, forming a signature – Ten utilization levels L1 to L10 per resource – Example: L1L2L3L9, L9L6L7L8, L2L3L10L6… 11/20/2008 By Mitesh R. Meswani 6
  • 7. Work Flow Step 1: Find Signatures of Real Applications Step 3:Execute Application Pairs using Predicted Priorities Serial Application Application Pair A, B Run Application and Performance Single- Periodically Sample Counter Threaded Counters Settings Mode Signature Read Signatures Signatures Data Base Signature Data Run Pair A, B Found Base No with Equal Dominating Priorities in Signatures ? SMT Mode Step 2: Create Signature Microbenchmarks for Frequently Appearing Signatures and Empirically Yes Signature of A,B Find Priority Predictions Prediction Read Priorities Signature-microbenchmark Pair X, Y Data Base Priority of A, Store CPI for all Run Signature- Priorities Priority of B priorities for Microbenchmark i, j in SMT Pair X, Y CPI Pair Run Pair A, B with Mode Predicted Identify Best Priorities in SMT Predictions Case Priority for Prediction Data Mode Pair X, Y Base 11/20/2008 By Mitesh R. Meswani 7
  • 8. Details of Step 1 • Four groups of counters were measured • Each group measured in separate runs • Sampled in one second time intervals Run 1 Run 2 Interval 0 Run 3 Run 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Sample# •The difference between the execution time across the 4 runs was negligible •For 99% of samples, the difference between the number of instructions and run cycles was negligible 11/20/2008 By Mitesh R. Meswani 8
  • 9. Different Signatures are Present in Real Applications L1L1L1L1 Signature Histogram of Four SPEC CPU2006 and Two PETSc KSP Library Functions L3L1L1L1 100% L3L2L1L1 90% L2L1L1L1 L2L3L1L1 80% % of Total Cycles L2L2L1L1 L1L4L1L1 70% L1L1L9L5 L1L2L7L4 60% L1L1L7L4 L1L1L6L4 50% L1L2L6L3 L1L2L5L2 40% L1L3L1L1 L1L2L2L1 30% L1L2L3L1 L1L2L6L4 20% L1L2L5L4 L1L2L5L3 10% L1L2L4L3 L1L2L4L2 0% L1L2L3L2 429.mcf 416.gamess 444.namd 462.libquantum cgs gmres L1L1L2L1 Applications L1L2L1L1 11/20/2008 By Mitesh R. Meswani 9
  • 10. Conclusions 1. Showed that equal priorities (default) are not the best for nearly 47% of applications studied 2. Only 16 Signatures are sufficient to represent 95.5% of execution time of 20 SPEC CPU2006 benchmarks, 9 NAS NPB3.2 Serial benchmarks, 119 PETSc KSP, and 180 PETSc Matrix libraries 3. Priority predictions using signature benchmarks improve throughput over default settings for 87% of the 15 PETSc KSP coschedules. 11/20/2008 By Mitesh R. Meswani 10
  • 11. Applications with Multiple Signatures 11/20/2008 By Mitesh R. Meswani 11
  • 12. Future Work and References Future Work: • Identify applications with multiple signatures • Dynamic adaptation of priorities • Detecting signatures on the fly • Phase detection and Prediction for a truly adaptive system References: • M. R. Meswani, P. J. Teller, and S. Arunangiri., “A Study of the Influence of the POWER5 Dynamic Resource Balancing Hardware on Optimal Hardware Thread Priorities,” To Appear in the Proceedings of the 2008 Live Virtual Constructive Conference, Jan 2009, El Paso, TX • M. R. Meswani and P. J. Teller, “ Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000,” Proceedings of the 2nd International Workshop on Operating Systems Interference In High Performance Applications, in conjunction with the 15th International Conferences on Parallel Architectures and Compilation Techniques (PACT06) Conference, sponsored by ACM and IEEE, September 2006, Seattle, WA. 11/20/2008 By Mitesh R. Meswani 12
  • 13. Acknowledgements • This work is supported by AHPCRC Grant W11NF- 07-2-2007 • Amir Simon, IBM for his valuable assistance with fixing the firmware of the p550 machine 11/20/2008 By Mitesh R. Meswani 13
  • 14. Questions? 11/20/2008 By Mitesh R. Meswani 14
  • 15. EXTRA SLIDES 11/20/2008 By Mitesh R. Meswani 15
  • 16. Simultaneous Multithreading (SMT) Instruction Program Instruction Write FPU TLB Counter-X Buffer-X Back-X Instruction Decode FXU Fetch Instruction Program Instruction Write LSU Cache Counter-Y Buffer-Y Back-Y Data Data TLB Cache Thread-Y Resource Thread-X Resource Shared Resource Legend: SMT hardware contexts share most of the processor resources. 11/20/2008 By Mitesh R. Meswani 16
  • 17. Methodology Overview - 1 1. Identify significant subset of shared resources – Resources Identified: L2 unified cache, L2 unified TLB, Floating-point unit (FPU), and Fixed-point unit (FXU) 2. Identify and validate performance counters 3. Define utilization levels of resources in Single-Threaded mode, forming a signature – Ten utilization levels L1 to L10 per resource: L1 is 0%-10%, L2 is 11%-20%, …, L10 is 90%-100% – A signature is represented as utilization levels (L1-L10) of FPU, FXU, L2 cache, and L2 TLB. – Example: L1L2L3L9, L9L6L7L8, … 4. An application is said to have one dominating signature, if the signature is associated with at least 80% of the application execution time 11/20/2008 By Mitesh R. Meswani 17
  • 18. Results – 2: Small Subset of Signatures are Sufficient to Represent Majority of the Execution Time of Applications 16 Signatures are Sufficient to Represent 95.6% of Execution Time of 20 SPEC CPU2006, 9 NAS NPB3.2 Serial, 119 PETSc KSP, and 180 PETSc Matrix Benchmarks L1L3L1L1 4.4% L1L1L1L1 1.4% L1L2L1L1 1.9% L2L3L1L1 3.5% 2.4% 16.3% L3L1L3L1 L4L1L1L1 3.8% L2L1L1L1 3.9% L2L2L1L1 13.2% 4.3% L3L1L1L1 4.8% L1L2L2L1 5.0% L2L2L2L1 12.0% L1L2L3L1 5.3% L1L1L2L1 5.3% 6.8% 5.6% L2L1L2L1 L5L1L1L1 L3L1L2L1 Others (19) 11/20/2008 By Mitesh R. Meswani 18
  • 19. Results –Priority Predictions using Signature Benchmarks can Potentially Improve Throughput Signature Signature Prediction Thread X Thread X Thread Y Thread Y Best Case Worst Case 6-5 bicg L1L2L1L1 bicg L1L2L1L1 6-6 3-6 4-6 bicg L1L2L1L1 lsqr L1L3L1L1 6-6 2-6 5-6 bicg L1L2L1L1 tcqmr L1L1L1L1 6-2 1-6 6-6 lsqr L1L3L1L1 lsqr L1L3L1L1 6-5 1-6 5-6 lsqr L1L3L1L1 tcqmr L1L1L1L1 6-2 1-6 6-5 tcqmr L1L1L1L1 tcqmr L1L1L1L1 6-5 3-6 6-5 bcgs L1L1L1L1 bcgs L1L1L1L1 6-5 2-6 6-5 bcgs L1L1L1L1 bicg L1L2L1L1 6-5 2-6 6-5 bcgs L1L1L1L1 cgs L1L1L1L1 6-5 3-6 6-5 bcgs L1L1L1L1 chebychev L1L1L1L1 6-1 3-6 6-5 bcgs L1L1L1L1 cr L1L1L1L1 6-1 1-6 6-5 bcgs L1L1L1L1 gmres L1L1L1L1 6-1 2-6 6-5 bcgs L1L1L1L1 lsqr L1L3L1L1 6-5 1-6 6-5 bcgs L1L1L1L1 richardson L1L1L1L1 6-1 3-6 6-5 bcgs L1L1L1L1 tcqmr L1L1L1L1 6-1 1-6 For 15 PETSc KSP co-schedules, predicted settings • improved throughput over default for 87% of co-schedules, • are the best for 33% of co-schedules, and • are never the worst case settings 11/20/2008 By Mitesh R. Meswani 19
  • 20. Signatures in Applications • PETSc Linear Solvers • Identify signature using performance counters • Results: • STORY: – Using simulator, showed that intelligent settings of hardware thread priorities can enhance workload performance – Critical microarchitecture resource usage “signatures” can be used to determine “best” priorities – Different signatures exist in real-world applications and have been shown to be useful in enhancing utilization and throughput 11/20/2008 By Mitesh R. Meswani 20
  • 21. Signatures and Application Phases Interval Phase Transitions Consecutive Phases Application executions are composed of multiple • phases For each phase in Single-Threaded mode, monitor • utilization of shared resources for each phase (Signature) Resource utilization can be used to estimate • availability of resources for other threads Given signatures of two threads, predict thread • priorities that maximize overall throughput 11/20/2008 By Mitesh R. Meswani 21
  • 22. POWER5 Chip • POWER5 Chip:Two identical cores, each core with two SMT threads, 64KB L1 ICache, 32KB L1 DCache, Shared Unified 1.92MB L2 Cache, off- chip 36MB L3 Cache, 128-entry L1 ITLB, 128- entry L1 DTLB, and 1024-entry Unified L2 TLB 11/20/2008 By Mitesh R. Meswani 22
  • 23. FPU and FXU Benchmark FPU Benchmark for Maximum Utilization (99%) FXU Benchmark for Maximum Utilization (70%) Loop: Loop: fadd R0,R0,R0, addi R0,R0,0 … … fadd R31,R31,R31 addi R31,R31,31 above block copied four times above block copied six times count++; count++; branch to loop if count<max branch to loop if count<max • Benchmarks runs for 100s in Single-Threaded mode • Data dependencies and noops are introduced to lower utilization levels • Utilization achieved was: – FPU : 10% to 99% – FXU : 10% to 70% 11/20/2008 By Mitesh R. Meswani 23
  • 24. L2 Cache and L2 TLB Benchmark L2 Cache Benchmark for Maximum Utilization (99%) L2 TLB Benchmark for Maximum Utilization (99%) 1. Allocate array bigger than L2 cache 1. Allocate array bigger than number of pages mapped 2. First element of cache line 1 points to first element of by TLB entries line 4, which points to first element of line 7, and so 2. First element of a page to first element of next page; on; stride is 3 cache lines stride is one page 3. Main body implements pointer chasing shown below: 3. Main body implements pointer chasing shown below: for(j=0;j<1000000;j++) for(j=0;j<400000;j++) { elem=(int *)arr[0]; //initialize to point to first element { elem=(int *)arr[0]; //initialize to point to first element while(elem!=NULL) // continue while not last line while(elem!=NULL) // continue while not last page elem=(int *)*elem; // load address of line + stride elem=(int *)*elem; // load address of next page } } • Benchmarks runs for 100s in Single-Threaded mode • Repeated access to an element are introduced in the while loop to reduced utilization levels • Utilization was achieved in the range of 10% to 99% 11/20/2008 By Mitesh R. Meswani 24
  • 25. Multi-resource Signature Benchmark LLHH Benchmark HHLL Benchmark 1. Allocate array bigger than number of pages mapped 1. Allocate array bigger than L2 cache by TLB entries 2. First element of a line points to first element of next 2. First element of a page to first element of next page; line; stride is one cache line stride is one page 3. Main body consists of floating-point and integer 3. Main body implements pointer chasing and a few operations and pointer chasing shown below: floating-point and integer operations shown below: for(j=0;j<9000;j++) for(j=0;j<390000;j++) { elem=(int *)arr[0]; //initialize to point to first element { elem=(int *)arr[0]; //initialize to point to first element while(elem!=NULL) // continue while not last line while(elem!=NULL) // continue while not last page { 168 floating-point additions; { elem=(int *)*elem; // load address of next page 168 integer additions; 8 floating-point additions; elem=(int *)*elem; // load address of next line 8 integer additions; } } } } • Loop body varies number of fpu, fxu operations and stride access to achieve desired signature • Each benchmark runs for 100s in Single-Threaded mode • Total of 12 signatures out of 16 possible were developed, – Signatures developed are: LLLL, LLHL, LLHH, LHLL, LHHL, LHHH, HLLL, HLHL, HLHH, HHLL, HHHL, HHHH – Signatures with low utilization of L2 cache and high utilization of TLB were not developed, namely LLLH, LHLH, HLLH, HHLH 11/20/2008 By Mitesh R. Meswani 25