SlideShare a Scribd company logo
1 of 24
Revisiting Power5/Power5+
 Performance Differences


          Mike Page
          ScicomP 14
    Poughkeepsie, New York
         May 23, 2008
     NCAR/CISL/HSS/CSG
   Consulting Services Group
       mpage@ucar.edu
Model Performance: Bluevista vs. Blueice
MODEL PROCS         BV            BL            FAST   100*BL/BV         DIFF %
cam_waccm 256              5.61          5.96   BL            106.23         -6.23
cam_waccm 128              3.34          3.12   BV             93.41         6.58
cam_waccm 64               2.13          2.13   SAME               100            0
cam_waccm 32               1.13          1.15   BL            101.76         -1.76
POP 128                   22.29         21.67   BL             97.21         -2.78
POP 64                    34.96         36.24   BV            103.66         3.66
POP 48                     42.2         43.76   BV            103.69         3.69
POP 32                     56.8         65.34   BV            115.03        15.03
POP 24                    71.83         79.96   BV            111.31        11.31
POP 16                   103.96        112.75   BV            108.45         8.45
POP 8                    197.11        231.32   BV            117.35        17.35
hd3D 128                 0.1857        0.2188   BV            117.82        17.82
hd3D 64                  0.3046        0.3961   BV            130.03        30.03
hd3D 32                  0.5261        0.5655   BV            107.48         7.48
hd3D 16                  0.9408        0.9459   BV            100.54         0.54
hd3D 8                   1.7917        1.7028   BL             95.03         -4.96
WRF 256                   0.158        0.1466   BL             92.78         -7.21
WRF 128                  0.2316         0.239   BV            103.19         3.19
WRF 64                   0.3849        0.3934   BV            102.20         2.20
WRF 32                   0.6736        0.7274   BV            107.98         7.98
WRF 16                   1.3584         1.407   BV            103.57         3.57
WRF 8                    2.6308        2.5783   BL             98.00         -1.99
WRF 4                    4.9209        4.8376   BL             98.30         -1.69
WRF 2                    9.8538        9.5474   BL             96.89         -3.10
WRF 1                19.8319       17.6361      BL             88.92        -11.07
A Graphical Look at the Data:
                                       CAM_WACCM                                                                                        POP

  6                                                                                       250

 5.5

  5                                                                                       200

 4.5

  4                                                                                       150

 3.5

  3                                                                                       100

 2.5
                                                          Blueice

   2
ICESS Benchmark Time (sec)                                Bluevista                      ICESS Benchmark Time(sec)
                                                                                           50

 1.5

  1                                                                                            0
       32          64        96         128         160        192         224     256             0                  32                      64                     96                 128
                                        Processor Count                                                                               Processor Count



                                         HD3D                                                                                           WRF

   2                                                                                      20

                                                                                          18

                                                                                          16
 1.5
                                                                                          14

                                                                                          12

   1                                                                                      10

                                                                                           8

                                                                                           6
 0.5
ICESS Benchmark Time (sec)                                                                 4
                                                                                         ICESS Benchmark Time (sec)

                                                                                           2

   0                                                                                       0
       0      16        32        48      64       80     96         112     128   144         0       20   40   60        80   100     120        140   160   180    200   220   240   260
                                        Processor Count                                                                               Processor Count
A Graphical Look at the Data:
                                            Blueice/Bluevista performace

 1.400

                                      Bluevista Faster
 1.300




 1.200

                                                                                             cam_waccm
                                                                                             POP
 1.100                                                                                       hd3D
                                                                                             WRF
                                                                                             Equal

 1.000


Ratio of ICESS Benchmark Times

 0.900



                                      Blueice Faster
 0.800
         0         32            64    96           128          160       192   224   256
                                               Processor Count
Huang/Ghosh
• Analysis of POP performance
   • Varied run configuration (ptile) for POP
• Conclusions
   • POP performance on Blueice improves by 13%
       when nodes are undersubscribed
       • Undersubscription uses only 8 of 16 processors on a
              Blueice node
       • Undersubscription avoids sharing L3 cache
   • POP performance on Blueice exceeds that on
      Bluevista if Blueice nodes are undersubscribed
   • POP Blueice vs. fully-subscribed-Bluevista
             performance difference is mainly due to L2
             cache misses
But here’s what caught my interest!
                                                                                                         Model Performance: Bluevista vs. Blueice
                                                                                         MODEL PROCS            BV              BL             FAST   100*BL/BV           DIFF %

                                                                                         cam_waccm 256                  5.61            5.96   BL             106.23                -6.23
                                      Blueice/Bluevista
                                                                                         cam_waccm 128                  3.34            3.12   BV                 93.41             6.58
 1.400



                            Bluevista Faster                                             cam_waccm 64                   2.13            2.13   SAME                100                 0
 1.300

                                                                                         cam_waccm 32                   1.13            1.15   BL             101.76                -1.76

 1.200
                                                                                         POP 128                       22.29           21.67   BL                 97.21             -2.78
                                                                             cam_waccm
                                                                             POP
 1.100                                                                       hd3D        POP 64                        34.96           36.24   BV             103.66                3.66
                                                                             WRF
                                                                             Equal
                                                                                         POP 48                         42.2           43.76   BV             103.69                3.69
 1.000
ICESS Benchmark Time


                                                                                         POP 32                         56.8           65.34   BV             115.03               15.03
 0.900

                             Blueice Faster                                              POP 24                        71.83           79.96   BV             111.31               11.31

 0.800
         0       32    64     96        128          160   192   224   256               POP 16                       103.96          112.75   BV             108.45                8.45
                                   Processor Count


                                                                                         POP 8                        197.11          231.32   BV             117.35               17.35




                                                                                         hd3D 128             0.1857           0.2188          BV     117.82                17.82
     hd3D shows largest                                                                  hd3D 64              0.3046           0.3961          BV     130.03                30.03
     performance variation
                                                                                         hd3D 32              0.5261           0.5655          BV     107.48                  7.48
     of all the apps in the
     ICESS suite                                                                         hd3D 16              0.9408           0.9459          BV     100.54                  0.54
                                                                                         hd3D 8               1.7917           1.7028          BL       95.03                -4.96
                                                                                         WRF 256                       0.158          0.1466   BL                 92.78             -7.21


                                                                                         WRF 128                      0.2316           0.239   BV             103.19                3.19


                                                                                         WRF 64                       0.3849          0.3934   BV             102.20                2.20


                                                                                         WRF 32                       0.6736          0.7274   BV             107.98                7.98


                                                                                         WRF 16                       1.3584           1.407   BV             103.57                3.57


                                                                                         WRF 8                        2.6308          2.5783   BL                 98.00             -1.99


                                                                                         WRF 4                        4.9209          4.8376   BL                 98.30             -1.69


                                                                                         WRF 2                        9.8538          9.5474   BL                 96.89             -3.10


                                                                                         WRF 1                       19.8319         17.6361   BL                 88.92            -11.07
hd3D needed to be
      studied too
With the new Power 5+ system and an
AIX upgrade there were many new factors
that could affect performance:
   • SMT
   • Varied page sizes
   • Processor Binding
hd3D Details
  HD3D is a pseudospectral three-dimensional periodic
   hydrodynamic/magnetohydrodynamic/Hall-MHD
                 turbulence model.

 The results presented here are derived from a numerical
solution of the incompressible Navier-Stokes equations in
  3 dimensions with periodic boundary conditions on a
                   256 x 256 x 256 grid.

hd3D uses a pseudo-spectral method to compute spatial
derivatives, while adjustable order Runge-Kutta method
    is used to evolve the system in the time domain.

    This benchmark does a free-decay simulation of
               Taylor-Green vortices.
A Closer Look at hd3D - Bluevista runs
                              Bluevista - Single Core, Private L3
                                 (smt degrades performance)

 1.200



 1.000



 0.800


                                                                          non-smt
 0.600                                                                    smt
                                                                          Huang_Ghosh

 0.400


 ICESS Benchmark Time (sec)
  0.200



 0.000
         32                   64                       96           128
                                     Processor Count
A Closer Look at hd3D - Blueice runs
                                Blueice HD3D (Shared L3)
                               (smt degrades performance)

   1.200


   1.000


   0.800


   0.600                                                          non-smt
                                                                  smt
                                                                  Huang_Ghosh
   0.400

  ICESS Benchmark (sec)
   0.200


   0.000
           32             64                      96        128
                                Processor Count
Huang/Ghosh Run Configurations
        Model Performance: Bluevista vs. Blueice
    MODEL PROCS               BV         BL          SMT?        FAST   DIFF %
    cam_waccm 128               3.34      3.12   2TPP (16 OMP)    BV     6.58

    cam_waccm 64                2.13      2.13   2TPP (16 OMP)   SAME     0
    cam_waccm 32                1.13      1.15   2TPP (16 OMP)    BL    -1.76
    POP 128                    22.29     21.67       1TPP         BL    -2.78

    POP 64                     34.96     36.24       2TPP         BV     3.66
    POP 48                      42.2     43.76       2TPP         BV     3.69
    POP 32                      56.8     65.34       2TPP         BV    15.03

    hd3D 128                  0.1857    0.2188       1TPP         BV    17.82
    hd3D 64                   0.3046    0.3961       1TPP         BV    30.03
    hd3D 32                   0.5261    0.5655       1TPP         BV     7.48
    WRF 128                   0.2316     0.239       2TPP         BV     3.19

    WRF 64                    0.3849    0.3934       2TPP         BV     2.20
    WRF 32                    0.6736    0.7274       2TPP         BV     7.98




               Give up on SMT for hd3D, look at shared L3 effects
A Closer Look at hd3D - Blueice runs
                                   Blueice HD3D (no smt)
                              Private L3 improves performance

  0.700



  0.600



  0.500
                                                                      Private L3
                                                                      Shared L3
                                                                      Huang/Ghosh
  0.400


 ICESS Benchmark (sec)
  0.300



  0.200
          32             64                          96         128
                                  Processor Count




     This supports the conclusion that underscribing Blueice nodes,
       making L3 cache private, improves Blueice performance.
A Closer Look at hd3D
                                  Bluevista (no smt)
                                           vs.
                              Blueice (no smt, private L3)

0.600



0.500



0.400
                                                                         Bluevista
                                                                         Blueice
0.300



 0.200
ICESS Benchmark Time (sec)



0.100
        32                   64                       96           128
                                   Processor Count



    • Undersubscribing Blueice improves hd3D performance. It is close to,
             but still slower than, Bluevista.
    • For hd3D, Bluevista still outperforms Blueice with ~6% difference for
             32 and 64 processors and ~2% for 128.
    • While Blueice POP is 13% faster than Bluevista POP on 16
             logical(?) cpus, hd3D shows the opposite behavior.
A Closer Look at hd3D - memory
                             HD3D Memory Footprint

 70




 60




 50




 40


Mem Req per Task (Mb)


 30




 20
      32                64                           96   128
                                   Processor Count
A Closer Look at hd3D
Two issues to investigate:
  - 1TPP/2TPP differences
                                      Bluevista - Single Core, Private L3
                                        (smt degrades performance)

         1.200



         1.000



         0.800


                                                                                    non-smt
         0.600                                                                      smt
                                                                                    Huang_Ghosh

         0.400


         ICESS Benchmark Time (sec)
          0.200



         0.000
                 32                   64                          96        128
                                              Processor Count




  - Blueice/Bluevista 1TPP differences
                                                Bluevista (no smt)
                                                        vs.
                                           Blueice (no smt, private L3)

         0.600



         0.500



         0.400
                                                                                        Bluevista
                                                                                        Blueice
         0.300



          0.200
         ICESS Benchmark Time (sec)



         0.100
                 32                    64                              96         128
                                                Processor Count
A Closer Look at hd3D
           CPI Breakdown Analysis
• Uses multiple Hardware Performance Counters on the processor to:
    • Track processor cycles required to complete a given workload
        • hd3D computational kernel with hpmcount API
    • Track events in processor core
    • Track events in the memory subsystem
• 17 counters required for Power5/Power5+ CPI Breakdown
      PM_IOPS_CMPL              PM_CMPL_STALL_LSU
      PM_INST_CMPL              PM_CMPL_STALL_REJECT
      PM_RUN_CYC                PM_CMPL_STALL_DCACHE_MISS
      PM_GRP_CMPL               PM_CMPL_STALL_ERAT_MISS
      PM_GCT_NOSLOT_CYC         PM_CMPL_STALL_FXU
      PM_GCT_NOSLOT_IC_MISS     PM_CMPL_STALL_DIV
      PM_GCT_NOSLOT_SRQ_FULL    PM_CMPL_STALL_FDIV
      PM_GCT_NOSLOT_BR_MPRED    PM_CMPL_STALL_FPU
      PM_1PLUS_PPC_CMPL
A Closer Look at hd3D
    2TPP/1TPP performance differences
                                          Bluevista HPM Counters
                                             hd3D on 32 cpus


3.0E+12

2.5E+12

2.0E+12
                                                                                                            nosmt
1.5E+12                                                                                                     smt
1.0E+12

5.0E+11

0.0E+00




              PM_RUN_CYC
       PM_INST_CMPLPM_GRP_CMPL
 PM_IOPS_CMPL

                   PM_GCT_NOSLOT_CYC     PM_CMPLU_STALL_LSU           PM_CMPLU_STALL_DIVPM_1PLUS_PPC_CMPL
                                                               PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU
                                                                           PM_CMPLU_STALL_FDIV
                      PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_REJECT
                          PM_GCT_NOSLOT_SRQ_FULL
                               PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS
                                              PM_CMPLU_STALL_DCACHE_MISS
A Closer Look at hd3D
 2TPP/1TPP performance differences
Ratio of smt counters to non-smt counters - Bluevista
     Ratios: smt/nosmt               32 tasks       64 tasks       128 tasks
     PM_IOPS_CMPL                            0.94           0.96            1.02
     PM_INST_CMPL                            0.94           0.96            1.02
     PM_RUN_CYC                              1.67           1.66            1.67
     PM_GRP_CMPL                             0.94           0.95            1.03
     PM_GCT_NOSLOT_CYC                       3.21           2.72            2.94
     PM_GCT_NOSLOT_IC_MISS                   1.61           1.74            1.71
     PM_GCT_NOSLOT_SRQ_FULL               5101.47        2380.99         470.68
     PM_GCT_NOSLOT_BR_MPRED                  3.67           3.27            3.78
     PM_CMPLU_STALL_LSU                      2.48           2.62            2.28
     PM_CMPLU_STALL_REJECT                   2.82           3.90            3.21
     PM_CMPLU_STALL_DCACHE_MISS              2.42           2.36            2.21
     PM_CMPLU_STALL_ERAT_MISS                2.32           2.67            3.06
     PM_CMPLU_STALL_FXU                      1.43           1.48            1.61
     PM_CMPLU_STALL_DIV                      1.13           1.20            1.17
     PM_CMPLU_STALL_FDIV                     1.15           1.30            1.11
     PM_CMPLU_STALL_FPU                      2.11           2.15            2.05


     pmlist -d -c 2,244
     PM_LSU_SRQ_FULL_CYC,Cycles SRQ full
      Cycles the Store Request Queue is full.
A Closer Look at hd3D
2TPP/1TPP performance differences
 hd3D Bluevista                                                            nosmt_32     smt_32

                                 T            PM_RUN_CYC

 Completed(A)                    A            PM_GRP_CMPL                       0.373    0.210

 Completion Cycles(A1)           A1           PM_1PLUS_PPC_CMPL                 0.355    0.200

 Completed(A1A)                  A1A          PM_INST_CMPL/5                    0.251    0.142

 Overhead(A1B)                   A1-A1A                                         0.104    0.058

 Overhead(A2)                    A-A1                                           0.018    0.011

 Empty(B)                        B            PM_GCT_NOSLOT_CYC                 0.025    0.048

 Miss(B1)                        B1           PM_GCT_NOSLOT_IC_MISS             0.003    0.003

 Mispredict(B2)                  B2           PM_GCT_NOSLOT_BR_MPRED            0.012    0.027

                                 B3           PM_GCT_NOSLOT_SRQ_FULL            0.000    0.000

                                 B-B1-B2-B3                                     0.009    0.017

                         T-A-B   C                                              0.602    0.742

                                 C1           PM_CMPLU_STALL_LSU                0.260    0.387

                                 C1A          PM_CMPLU_STALL_REJECT             0.086    0.146

 Miss(C1A1)                      C1A1         PM_CMPLU_STALL_ERAT_MISS          0.022    0.031

                                 C1A-C1A1                                       0.064    0.115

 Miss(C1B)                       C1B          PM_CMPLU_STALL_DCACHE_MISS        0.088    0.128

                                 C1-C1A-C1B                                     0.428    0.469

                                 C2           PM_CMPLU_STALL_FXU                0.057    0.049

                                 C2A          PM_CMPLU_STALL_DIV                0.015    0.010

                                 C2-C2A                                         0.042    0.039

                                 C3           PM_CMPLU_STALL_FPU                0.205    0.260

                                 C3A          PM_CMPLU_STALL_FDIV               0.046    0.032

                                 C3-C3A                                         0.159    0.229

                                 C-C1-C2-C3                                     0.254    0.287

                                 CPI                                            3.480    3.638
A Closer Look at hd3D
     Blueice/Bluevista 1TPP differences
                                          Bv and Bl HPM Counters
                                           nosmt_32, private L3

3.E+12


2.E+12


2.E+12
                                                                                                          Bluevista
                                                                                                          Blueice
1.E+12


5.E+11


0.E+00




             PM_RUN_CYC
      PM_INST_CMPL PM_GRP_CMPL
PM_IOPS_CMPL

                    PM_GCT_NOSLOT_CYC      PM_CMPLU_STALL_LSU            PM_CMPLU_STALL_DIV
                                                                   PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU
                                                                              PM_CMPLU_STALL_FDIV
                       PM_GCT_NOSLOT_IC_MISS   PM_CMPLU_STALL_REJECT
                            PM_GCT_NOSLOT_SRQ_FULL
                                 PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS
                                                 PM_CMPLU_STALL_DCACHE_MISS
A Closer Look at hd3D
 Blueice/Bluevista 1TPP differences
                                           Ratio of PM Counters


                                                  PM_IOPS_CMPL
                                                  6.000
                           PM_CMPLU_STALL_FPU                        PM_INST_CMPL
                                                  5.000
                   PM_CMPLU_STALL_FDIV                                      PM_RUN_CYC
                                                  4.000

                                                  3.000
               PM_CMPLU_STALL_DIV                                                   PM_GRP_CMPL
                                                  2.000

                                                  1.000

             PM_CMPLU_STALL_FXU                   0.000                              PM_GCT_NOSLOT_CYC




         PM_CMPLU_STALL_ERAT_MISS                                                   PM_GCT_NOSLOT_IC_MISS



            PM_CMPLU_STALL_DCACHE_MISS                                      PM_GCT_NOSLOT_SRQ_FULL


                        PM_CMPLU_STALL_REJECT                    PM_GCT_NOSLOT_BR_MPRED
                                                PM_CMPLU_STALL_LSU




pmlist -p POWER5 -d -c 4,8
PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss
  Following a completion stall (any period when no groups completed) the last instruction to finish before
completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.
A Closer Look at hd3D
Blueice/Bluevista 1TPP differences
hd3D Bluevista                                                     nosmt_32     smt_32

                         T            PM_RUN_CYC

Completed(A)             A            PM_GRP_CMPL                       0.373    0.210

Completion               A1           PM_1PLUS_PPC_CMPL                 0.355    0.200
Cycles(A1)

Completed(A1A)           A1A          PM_INST_CMPL/5                    0.251    0.142

Overhead(A1B)            A1-A1A                                         0.104    0.058

Overhead(A2)             A-A1                                           0.018    0.011

Empty(B)                 B            PM_GCT_NOSLOT_CYC                 0.025    0.048

Miss(B1)                 B1           PM_GCT_NOSLOT_IC_MISS             0.003    0.003

Mispredict(B2)           B2           PM_GCT_NOSLOT_BR_MPRED            0.012    0.027

                         B3           PM_GCT_NOSLOT_SRQ_FULL            0.000    0.000

                         B-B1-B2-B3                                     0.009    0.017

                 T-A-B   C                                              0.602    0.742

                         C1           PM_CMPLU_STALL_LSU                0.260    0.387

                         C1A          PM_CMPLU_STALL_REJECT             0.086    0.146

Miss(C1A1)               C1A1         PM_CMPLU_STALL_ERAT_MISS          0.022    0.031

                         C1A-C1A1                                       0.064    0.115

Miss(C1B)                C1B          PM_CMPLU_STALL_DCACHE_MISS        0.088    0.128

                         C1-C1A-C1B                                     0.428    0.469

                         C2           PM_CMPLU_STALL_FXU                0.057    0.049

                         C2A          PM_CMPLU_STALL_DIV                0.015    0.010

                         C2-C2A                                         0.042    0.039

                         C3           PM_CMPLU_STALL_FPU                0.205    0.260

                         C3A          PM_CMPLU_STALL_FDIV               0.046    0.032

                         C3-C3A                                         0.159    0.229

                         C-C1-C2-C3                                     0.254    0.287

                         CPI                                            3.480    3.638
Conclusions
             > Nothing new <
          Sharing cache can degrade performance
But lots of questions remain:
• Gathering and processing the data from performance counters
was extremely tedious.
     •Is there an easier way?
     • Does difficulty increase exponentially with level of detail?
• Are the Power 5/5+ performance counters accurate?
     •Some say not
     • Eyerman, et. al. (ASPLOS, 2004)
• What do the counters mean?
     • Are there expanded references besides pmlist?
• What is an ERAT miss?
     • What does it say about code performance?
• Will ACTC tools give more info that what’s available via
           pmlist?
•
•
•
?

More Related Content

Viewers also liked

Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...
Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...
Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...Tuomas Poukkula (顾度茂)
 
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...Viral Network Inc
 
iOS7-User-Experience-Shootout
iOS7-User-Experience-ShootoutiOS7-User-Experience-Shootout
iOS7-User-Experience-ShootoutGeoffrey Dorne
 
Wedding Book
Wedding BookWedding Book
Wedding Bookgregyates
 
Lrf Avaliacao Dos Resultados
Lrf Avaliacao Dos ResultadosLrf Avaliacao Dos Resultados
Lrf Avaliacao Dos Resultadosmarcosurl
 
Diari paco mariana 1978
Diari paco mariana 1978Diari paco mariana 1978
Diari paco mariana 1978Vicent Bou
 
El Conocimiento CientíFico
El Conocimiento CientíFicoEl Conocimiento CientíFico
El Conocimiento CientíFicogueste1ce99f
 
Trabajo informáctica
Trabajo informácticaTrabajo informáctica
Trabajo informácticarufere
 
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...CIAT
 
Tmc Powerpoint
Tmc PowerpointTmc Powerpoint
Tmc Powerpointjpttmcbds
 
History of Synthetic Turf
History of Synthetic TurfHistory of Synthetic Turf
History of Synthetic TurfTuff Turf
 
Ita2009 4dia
Ita2009 4diaIta2009 4dia
Ita2009 4diacavip
 

Viewers also liked (20)

Actividad del agua
Actividad del aguaActividad del agua
Actividad del agua
 
Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...
Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...
Perinteisestä Todennäköisyysteoriasta Poikkeava Näkökulma Epävarmuuden Mallin...
 
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...
Red Eagle Mining - Salman Partners "Accelerating Development at San Ramon-Pot...
 
Wheel Tracks October 2012
Wheel Tracks October 2012Wheel Tracks October 2012
Wheel Tracks October 2012
 
E – Waste Management through Regulations
E – Waste Management through RegulationsE – Waste Management through Regulations
E – Waste Management through Regulations
 
iOS7-User-Experience-Shootout
iOS7-User-Experience-ShootoutiOS7-User-Experience-Shootout
iOS7-User-Experience-Shootout
 
Wedding Book
Wedding BookWedding Book
Wedding Book
 
Lrf Avaliacao Dos Resultados
Lrf Avaliacao Dos ResultadosLrf Avaliacao Dos Resultados
Lrf Avaliacao Dos Resultados
 
Diari paco mariana 1978
Diari paco mariana 1978Diari paco mariana 1978
Diari paco mariana 1978
 
El Conocimiento CientíFico
El Conocimiento CientíFicoEl Conocimiento CientíFico
El Conocimiento CientíFico
 
Trabajo informáctica
Trabajo informácticaTrabajo informáctica
Trabajo informáctica
 
Cicc95
Cicc95Cicc95
Cicc95
 
Texto história epidemiologia
Texto história epidemiologiaTexto história epidemiologia
Texto história epidemiologia
 
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...
Poster15: Prececal and Cecal In Vitro digestibility of tropical double purpos...
 
New horizons vol4issue14..
New horizons vol4issue14..New horizons vol4issue14..
New horizons vol4issue14..
 
13.Safety & navigation
13.Safety & navigation13.Safety & navigation
13.Safety & navigation
 
Profit Profiler
Profit Profiler Profit Profiler
Profit Profiler
 
Tmc Powerpoint
Tmc PowerpointTmc Powerpoint
Tmc Powerpoint
 
History of Synthetic Turf
History of Synthetic TurfHistory of Synthetic Turf
History of Synthetic Turf
 
Ita2009 4dia
Ita2009 4diaIta2009 4dia
Ita2009 4dia
 

Similar to M Page ScicomP14 Hd3D

Similar to M Page ScicomP14 Hd3D (20)

FULL TIME SUPPORT QUOTAS MARCH 2011
FULL TIME SUPPORT QUOTAS MARCH 2011FULL TIME SUPPORT QUOTAS MARCH 2011
FULL TIME SUPPORT QUOTAS MARCH 2011
 
Southern chemicals with changes
Southern chemicals with changesSouthern chemicals with changes
Southern chemicals with changes
 
Southern chemicals with changes
Southern chemicals with changesSouthern chemicals with changes
Southern chemicals with changes
 
Sabrina olivares
Sabrina olivaresSabrina olivares
Sabrina olivares
 
Transformer design
Transformer designTransformer design
Transformer design
 
Southern chemicals1
Southern chemicals1Southern chemicals1
Southern chemicals1
 
News Release: Significant Heavy Rare Earths at Eldor Property
News Release:  Significant Heavy Rare Earths at Eldor PropertyNews Release:  Significant Heavy Rare Earths at Eldor Property
News Release: Significant Heavy Rare Earths at Eldor Property
 
Copy of southern chemicals1
Copy of southern chemicals1Copy of southern chemicals1
Copy of southern chemicals1
 
Southern chemicals1
Southern chemicals1Southern chemicals1
Southern chemicals1
 
Metrado de madera
Metrado de maderaMetrado de madera
Metrado de madera
 
Duty Computation 0ther
Duty Computation 0therDuty Computation 0ther
Duty Computation 0ther
 
Biplot actul oca
Biplot actul ocaBiplot actul oca
Biplot actul oca
 
Awd3 m1 desi
Awd3 m1 desiAwd3 m1 desi
Awd3 m1 desi
 
Rengineering Patient Access
Rengineering Patient AccessRengineering Patient Access
Rengineering Patient Access
 
7
77
7
 
7
77
7
 
Gravity water supply design illustration using SW software
Gravity water supply design illustration using SW softwareGravity water supply design illustration using SW software
Gravity water supply design illustration using SW software
 
ACI Pharmaceuticals Working Capital Management Group Presentation
ACI Pharmaceuticals Working Capital Management  Group PresentationACI Pharmaceuticals Working Capital Management  Group Presentation
ACI Pharmaceuticals Working Capital Management Group Presentation
 
FreakonomicsOfScrum spreadsheet
FreakonomicsOfScrum spreadsheetFreakonomicsOfScrum spreadsheet
FreakonomicsOfScrum spreadsheet
 
Vpectrl3 ram steel - beam summary
Vpectrl3   ram steel - beam summaryVpectrl3   ram steel - beam summary
Vpectrl3 ram steel - beam summary
 

M Page ScicomP14 Hd3D

  • 1. Revisiting Power5/Power5+ Performance Differences Mike Page ScicomP 14 Poughkeepsie, New York May 23, 2008 NCAR/CISL/HSS/CSG Consulting Services Group mpage@ucar.edu
  • 2. Model Performance: Bluevista vs. Blueice MODEL PROCS BV BL FAST 100*BL/BV DIFF % cam_waccm 256 5.61 5.96 BL 106.23 -6.23 cam_waccm 128 3.34 3.12 BV 93.41 6.58 cam_waccm 64 2.13 2.13 SAME 100 0 cam_waccm 32 1.13 1.15 BL 101.76 -1.76 POP 128 22.29 21.67 BL 97.21 -2.78 POP 64 34.96 36.24 BV 103.66 3.66 POP 48 42.2 43.76 BV 103.69 3.69 POP 32 56.8 65.34 BV 115.03 15.03 POP 24 71.83 79.96 BV 111.31 11.31 POP 16 103.96 112.75 BV 108.45 8.45 POP 8 197.11 231.32 BV 117.35 17.35 hd3D 128 0.1857 0.2188 BV 117.82 17.82 hd3D 64 0.3046 0.3961 BV 130.03 30.03 hd3D 32 0.5261 0.5655 BV 107.48 7.48 hd3D 16 0.9408 0.9459 BV 100.54 0.54 hd3D 8 1.7917 1.7028 BL 95.03 -4.96 WRF 256 0.158 0.1466 BL 92.78 -7.21 WRF 128 0.2316 0.239 BV 103.19 3.19 WRF 64 0.3849 0.3934 BV 102.20 2.20 WRF 32 0.6736 0.7274 BV 107.98 7.98 WRF 16 1.3584 1.407 BV 103.57 3.57 WRF 8 2.6308 2.5783 BL 98.00 -1.99 WRF 4 4.9209 4.8376 BL 98.30 -1.69 WRF 2 9.8538 9.5474 BL 96.89 -3.10 WRF 1 19.8319 17.6361 BL 88.92 -11.07
  • 3. A Graphical Look at the Data: CAM_WACCM POP 6 250 5.5 5 200 4.5 4 150 3.5 3 100 2.5 Blueice 2 ICESS Benchmark Time (sec) Bluevista ICESS Benchmark Time(sec) 50 1.5 1 0 32 64 96 128 160 192 224 256 0 32 64 96 128 Processor Count Processor Count HD3D WRF 2 20 18 16 1.5 14 12 1 10 8 6 0.5 ICESS Benchmark Time (sec) 4 ICESS Benchmark Time (sec) 2 0 0 0 16 32 48 64 80 96 112 128 144 0 20 40 60 80 100 120 140 160 180 200 220 240 260 Processor Count Processor Count
  • 4. A Graphical Look at the Data: Blueice/Bluevista performace 1.400 Bluevista Faster 1.300 1.200 cam_waccm POP 1.100 hd3D WRF Equal 1.000 Ratio of ICESS Benchmark Times 0.900 Blueice Faster 0.800 0 32 64 96 128 160 192 224 256 Processor Count
  • 5. Huang/Ghosh • Analysis of POP performance • Varied run configuration (ptile) for POP • Conclusions • POP performance on Blueice improves by 13% when nodes are undersubscribed • Undersubscription uses only 8 of 16 processors on a Blueice node • Undersubscription avoids sharing L3 cache • POP performance on Blueice exceeds that on Bluevista if Blueice nodes are undersubscribed • POP Blueice vs. fully-subscribed-Bluevista performance difference is mainly due to L2 cache misses
  • 6. But here’s what caught my interest! Model Performance: Bluevista vs. Blueice MODEL PROCS BV BL FAST 100*BL/BV DIFF % cam_waccm 256 5.61 5.96 BL 106.23 -6.23 Blueice/Bluevista cam_waccm 128 3.34 3.12 BV 93.41 6.58 1.400 Bluevista Faster cam_waccm 64 2.13 2.13 SAME 100 0 1.300 cam_waccm 32 1.13 1.15 BL 101.76 -1.76 1.200 POP 128 22.29 21.67 BL 97.21 -2.78 cam_waccm POP 1.100 hd3D POP 64 34.96 36.24 BV 103.66 3.66 WRF Equal POP 48 42.2 43.76 BV 103.69 3.69 1.000 ICESS Benchmark Time POP 32 56.8 65.34 BV 115.03 15.03 0.900 Blueice Faster POP 24 71.83 79.96 BV 111.31 11.31 0.800 0 32 64 96 128 160 192 224 256 POP 16 103.96 112.75 BV 108.45 8.45 Processor Count POP 8 197.11 231.32 BV 117.35 17.35 hd3D 128 0.1857 0.2188 BV 117.82 17.82 hd3D shows largest hd3D 64 0.3046 0.3961 BV 130.03 30.03 performance variation hd3D 32 0.5261 0.5655 BV 107.48 7.48 of all the apps in the ICESS suite hd3D 16 0.9408 0.9459 BV 100.54 0.54 hd3D 8 1.7917 1.7028 BL 95.03 -4.96 WRF 256 0.158 0.1466 BL 92.78 -7.21 WRF 128 0.2316 0.239 BV 103.19 3.19 WRF 64 0.3849 0.3934 BV 102.20 2.20 WRF 32 0.6736 0.7274 BV 107.98 7.98 WRF 16 1.3584 1.407 BV 103.57 3.57 WRF 8 2.6308 2.5783 BL 98.00 -1.99 WRF 4 4.9209 4.8376 BL 98.30 -1.69 WRF 2 9.8538 9.5474 BL 96.89 -3.10 WRF 1 19.8319 17.6361 BL 88.92 -11.07
  • 7. hd3D needed to be studied too With the new Power 5+ system and an AIX upgrade there were many new factors that could affect performance: • SMT • Varied page sizes • Processor Binding
  • 8. hd3D Details HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/Hall-MHD turbulence model. The results presented here are derived from a numerical solution of the incompressible Navier-Stokes equations in 3 dimensions with periodic boundary conditions on a 256 x 256 x 256 grid. hd3D uses a pseudo-spectral method to compute spatial derivatives, while adjustable order Runge-Kutta method is used to evolve the system in the time domain. This benchmark does a free-decay simulation of Taylor-Green vortices.
  • 9. A Closer Look at hd3D - Bluevista runs Bluevista - Single Core, Private L3 (smt degrades performance) 1.200 1.000 0.800 non-smt 0.600 smt Huang_Ghosh 0.400 ICESS Benchmark Time (sec) 0.200 0.000 32 64 96 128 Processor Count
  • 10. A Closer Look at hd3D - Blueice runs Blueice HD3D (Shared L3) (smt degrades performance) 1.200 1.000 0.800 0.600 non-smt smt Huang_Ghosh 0.400 ICESS Benchmark (sec) 0.200 0.000 32 64 96 128 Processor Count
  • 11. Huang/Ghosh Run Configurations Model Performance: Bluevista vs. Blueice MODEL PROCS BV BL SMT? FAST DIFF % cam_waccm 128 3.34 3.12 2TPP (16 OMP) BV 6.58 cam_waccm 64 2.13 2.13 2TPP (16 OMP) SAME 0 cam_waccm 32 1.13 1.15 2TPP (16 OMP) BL -1.76 POP 128 22.29 21.67 1TPP BL -2.78 POP 64 34.96 36.24 2TPP BV 3.66 POP 48 42.2 43.76 2TPP BV 3.69 POP 32 56.8 65.34 2TPP BV 15.03 hd3D 128 0.1857 0.2188 1TPP BV 17.82 hd3D 64 0.3046 0.3961 1TPP BV 30.03 hd3D 32 0.5261 0.5655 1TPP BV 7.48 WRF 128 0.2316 0.239 2TPP BV 3.19 WRF 64 0.3849 0.3934 2TPP BV 2.20 WRF 32 0.6736 0.7274 2TPP BV 7.98 Give up on SMT for hd3D, look at shared L3 effects
  • 12. A Closer Look at hd3D - Blueice runs Blueice HD3D (no smt) Private L3 improves performance 0.700 0.600 0.500 Private L3 Shared L3 Huang/Ghosh 0.400 ICESS Benchmark (sec) 0.300 0.200 32 64 96 128 Processor Count This supports the conclusion that underscribing Blueice nodes, making L3 cache private, improves Blueice performance.
  • 13. A Closer Look at hd3D Bluevista (no smt) vs. Blueice (no smt, private L3) 0.600 0.500 0.400 Bluevista Blueice 0.300 0.200 ICESS Benchmark Time (sec) 0.100 32 64 96 128 Processor Count • Undersubscribing Blueice improves hd3D performance. It is close to, but still slower than, Bluevista. • For hd3D, Bluevista still outperforms Blueice with ~6% difference for 32 and 64 processors and ~2% for 128. • While Blueice POP is 13% faster than Bluevista POP on 16 logical(?) cpus, hd3D shows the opposite behavior.
  • 14. A Closer Look at hd3D - memory HD3D Memory Footprint 70 60 50 40 Mem Req per Task (Mb) 30 20 32 64 96 128 Processor Count
  • 15. A Closer Look at hd3D Two issues to investigate: - 1TPP/2TPP differences Bluevista - Single Core, Private L3 (smt degrades performance) 1.200 1.000 0.800 non-smt 0.600 smt Huang_Ghosh 0.400 ICESS Benchmark Time (sec) 0.200 0.000 32 64 96 128 Processor Count - Blueice/Bluevista 1TPP differences Bluevista (no smt) vs. Blueice (no smt, private L3) 0.600 0.500 0.400 Bluevista Blueice 0.300 0.200 ICESS Benchmark Time (sec) 0.100 32 64 96 128 Processor Count
  • 16. A Closer Look at hd3D CPI Breakdown Analysis • Uses multiple Hardware Performance Counters on the processor to: • Track processor cycles required to complete a given workload • hd3D computational kernel with hpmcount API • Track events in processor core • Track events in the memory subsystem • 17 counters required for Power5/Power5+ CPI Breakdown PM_IOPS_CMPL PM_CMPL_STALL_LSU PM_INST_CMPL PM_CMPL_STALL_REJECT PM_RUN_CYC PM_CMPL_STALL_DCACHE_MISS PM_GRP_CMPL PM_CMPL_STALL_ERAT_MISS PM_GCT_NOSLOT_CYC PM_CMPL_STALL_FXU PM_GCT_NOSLOT_IC_MISS PM_CMPL_STALL_DIV PM_GCT_NOSLOT_SRQ_FULL PM_CMPL_STALL_FDIV PM_GCT_NOSLOT_BR_MPRED PM_CMPL_STALL_FPU PM_1PLUS_PPC_CMPL
  • 17. A Closer Look at hd3D 2TPP/1TPP performance differences Bluevista HPM Counters hd3D on 32 cpus 3.0E+12 2.5E+12 2.0E+12 nosmt 1.5E+12 smt 1.0E+12 5.0E+11 0.0E+00 PM_RUN_CYC PM_INST_CMPLPM_GRP_CMPL PM_IOPS_CMPL PM_GCT_NOSLOT_CYC PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DIVPM_1PLUS_PPC_CMPL PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU PM_CMPLU_STALL_FDIV PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_REJECT PM_GCT_NOSLOT_SRQ_FULL PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS PM_CMPLU_STALL_DCACHE_MISS
  • 18. A Closer Look at hd3D 2TPP/1TPP performance differences Ratio of smt counters to non-smt counters - Bluevista Ratios: smt/nosmt 32 tasks 64 tasks 128 tasks PM_IOPS_CMPL 0.94 0.96 1.02 PM_INST_CMPL 0.94 0.96 1.02 PM_RUN_CYC 1.67 1.66 1.67 PM_GRP_CMPL 0.94 0.95 1.03 PM_GCT_NOSLOT_CYC 3.21 2.72 2.94 PM_GCT_NOSLOT_IC_MISS 1.61 1.74 1.71 PM_GCT_NOSLOT_SRQ_FULL 5101.47 2380.99 470.68 PM_GCT_NOSLOT_BR_MPRED 3.67 3.27 3.78 PM_CMPLU_STALL_LSU 2.48 2.62 2.28 PM_CMPLU_STALL_REJECT 2.82 3.90 3.21 PM_CMPLU_STALL_DCACHE_MISS 2.42 2.36 2.21 PM_CMPLU_STALL_ERAT_MISS 2.32 2.67 3.06 PM_CMPLU_STALL_FXU 1.43 1.48 1.61 PM_CMPLU_STALL_DIV 1.13 1.20 1.17 PM_CMPLU_STALL_FDIV 1.15 1.30 1.11 PM_CMPLU_STALL_FPU 2.11 2.15 2.05 pmlist -d -c 2,244 PM_LSU_SRQ_FULL_CYC,Cycles SRQ full Cycles the Store Request Queue is full.
  • 19. A Closer Look at hd3D 2TPP/1TPP performance differences hd3D Bluevista nosmt_32 smt_32 T PM_RUN_CYC Completed(A) A PM_GRP_CMPL 0.373 0.210 Completion Cycles(A1) A1 PM_1PLUS_PPC_CMPL 0.355 0.200 Completed(A1A) A1A PM_INST_CMPL/5 0.251 0.142 Overhead(A1B) A1-A1A 0.104 0.058 Overhead(A2) A-A1 0.018 0.011 Empty(B) B PM_GCT_NOSLOT_CYC 0.025 0.048 Miss(B1) B1 PM_GCT_NOSLOT_IC_MISS 0.003 0.003 Mispredict(B2) B2 PM_GCT_NOSLOT_BR_MPRED 0.012 0.027 B3 PM_GCT_NOSLOT_SRQ_FULL 0.000 0.000 B-B1-B2-B3 0.009 0.017 T-A-B C 0.602 0.742 C1 PM_CMPLU_STALL_LSU 0.260 0.387 C1A PM_CMPLU_STALL_REJECT 0.086 0.146 Miss(C1A1) C1A1 PM_CMPLU_STALL_ERAT_MISS 0.022 0.031 C1A-C1A1 0.064 0.115 Miss(C1B) C1B PM_CMPLU_STALL_DCACHE_MISS 0.088 0.128 C1-C1A-C1B 0.428 0.469 C2 PM_CMPLU_STALL_FXU 0.057 0.049 C2A PM_CMPLU_STALL_DIV 0.015 0.010 C2-C2A 0.042 0.039 C3 PM_CMPLU_STALL_FPU 0.205 0.260 C3A PM_CMPLU_STALL_FDIV 0.046 0.032 C3-C3A 0.159 0.229 C-C1-C2-C3 0.254 0.287 CPI 3.480 3.638
  • 20. A Closer Look at hd3D Blueice/Bluevista 1TPP differences Bv and Bl HPM Counters nosmt_32, private L3 3.E+12 2.E+12 2.E+12 Bluevista Blueice 1.E+12 5.E+11 0.E+00 PM_RUN_CYC PM_INST_CMPL PM_GRP_CMPL PM_IOPS_CMPL PM_GCT_NOSLOT_CYC PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DIV PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU PM_CMPLU_STALL_FDIV PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_REJECT PM_GCT_NOSLOT_SRQ_FULL PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS PM_CMPLU_STALL_DCACHE_MISS
  • 21. A Closer Look at hd3D Blueice/Bluevista 1TPP differences Ratio of PM Counters PM_IOPS_CMPL 6.000 PM_CMPLU_STALL_FPU PM_INST_CMPL 5.000 PM_CMPLU_STALL_FDIV PM_RUN_CYC 4.000 3.000 PM_CMPLU_STALL_DIV PM_GRP_CMPL 2.000 1.000 PM_CMPLU_STALL_FXU 0.000 PM_GCT_NOSLOT_CYC PM_CMPLU_STALL_ERAT_MISS PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_DCACHE_MISS PM_GCT_NOSLOT_SRQ_FULL PM_CMPLU_STALL_REJECT PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_LSU pmlist -p POWER5 -d -c 4,8 PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss Following a completion stall (any period when no groups completed) the last instruction to finish before completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.
  • 22. A Closer Look at hd3D Blueice/Bluevista 1TPP differences hd3D Bluevista nosmt_32 smt_32 T PM_RUN_CYC Completed(A) A PM_GRP_CMPL 0.373 0.210 Completion A1 PM_1PLUS_PPC_CMPL 0.355 0.200 Cycles(A1) Completed(A1A) A1A PM_INST_CMPL/5 0.251 0.142 Overhead(A1B) A1-A1A 0.104 0.058 Overhead(A2) A-A1 0.018 0.011 Empty(B) B PM_GCT_NOSLOT_CYC 0.025 0.048 Miss(B1) B1 PM_GCT_NOSLOT_IC_MISS 0.003 0.003 Mispredict(B2) B2 PM_GCT_NOSLOT_BR_MPRED 0.012 0.027 B3 PM_GCT_NOSLOT_SRQ_FULL 0.000 0.000 B-B1-B2-B3 0.009 0.017 T-A-B C 0.602 0.742 C1 PM_CMPLU_STALL_LSU 0.260 0.387 C1A PM_CMPLU_STALL_REJECT 0.086 0.146 Miss(C1A1) C1A1 PM_CMPLU_STALL_ERAT_MISS 0.022 0.031 C1A-C1A1 0.064 0.115 Miss(C1B) C1B PM_CMPLU_STALL_DCACHE_MISS 0.088 0.128 C1-C1A-C1B 0.428 0.469 C2 PM_CMPLU_STALL_FXU 0.057 0.049 C2A PM_CMPLU_STALL_DIV 0.015 0.010 C2-C2A 0.042 0.039 C3 PM_CMPLU_STALL_FPU 0.205 0.260 C3A PM_CMPLU_STALL_FDIV 0.046 0.032 C3-C3A 0.159 0.229 C-C1-C2-C3 0.254 0.287 CPI 3.480 3.638
  • 23. Conclusions > Nothing new < Sharing cache can degrade performance But lots of questions remain: • Gathering and processing the data from performance counters was extremely tedious. •Is there an easier way? • Does difficulty increase exponentially with level of detail? • Are the Power 5/5+ performance counters accurate? •Some say not • Eyerman, et. al. (ASPLOS, 2004) • What do the counters mean? • Are there expanded references besides pmlist? • What is an ERAT miss? • What does it say about code performance? • Will ACTC tools give more info that what’s available via pmlist? • • •
  • 24. ?

Editor's Notes

  1. Did not use hpmcount because focus was on a particular code kernel.