SlideShare a Scribd company logo
1 of 28
1.6
                       Minimum TLP   Optimal TLP
Normalized IPC


                 1.2


                 0.8


                 0.4
C C C              C C C
 L1   L1   L1       L1   L1   L1



       On Chip Network


 L2         L2     L2         L2
DRAM       DRAM   DRAM    DRAM
Application

Kernel    ...     Kernel      ...   Kernel


     CTA
           …       CTA
                           …    CTA

   Warp                         Warp




                                         Thread
                   …
Active Time
Ratio (RACT)

   0%
 100%



  20%
  40%
  80%
  60%
IPC       lat_rt       util
1.4

1.2                               AES

 1

0.8

0.6

0.4

0.2

 0
      1         2            3          4
IPC           lat_rt       util
1.4

1.2                                   CP

 1

0.8

0.6

0.4

0.2

 0
      1   2     3   4       5    6     7    8
IPC           lat_rt       util
1.4

1.2                                   JPEG

 1

0.8

0.6

0.4

0.2

 0
      1   2     3   4       5    6     7     8
1.05
           IPC           lat_rt       util
                                             MM
  1

0.95

 0.9

0.85

 0.8

0.75

 0.7
       1   2     3   4       5    6    7     8
1.4
Normalized IPC   1.2
                   1
                 0.8
                 0.6
                 0.4
                 0.2                       AES        MM
                   0                       JPEG       CP
                       1   2   3   4   5    6     7    8

                               Number of CTAs
1.4
Normalized latency   1.2
                       1
                     0.8
                     0.6
                     0.4
                     0.2                       AES        MM
                       0                       JPEG       CP
                           1   2   3   4   5    6     7    8

                                   Number of CTAs
1.4
Normalized core   1.2
  utilization       1
                  0.8
                  0.6
                  0.4
                  0.2                       AES        MM
                    0                       JPEG       CP
                        1   2   3   4   5     6    7    8

                                Number of CTAs
CTA 1
              CTA 2
Core 1
                CTA 3           Idle
         CTA 4



           CTA 5
                CTA 6
Core 2




                        CTA 7
              CTA 8
Core 1
         CTA 1          CTA 6
           CTA 2                CTA 7




                                        I
Core 2




             CTA 3              CTA 8   d
         CTA 4       CTA 5
                                        l
                                        e
C_idle                           C_mem
H       L                     L     M     H
                          L
                                - -
                C_stall
                          M
                              - -
                          H
                              -
: Increment n
: Decrement n
1
Relative performance
                       0.9

                       0.8

                       0.7                                Type1
                       0.6                                Type2
                       0.5
                              0    4    8    12   16

                             Number of cores turned off
1
Relative performance
                       0.9

                       0.8

                       0.7                                Type1
                       0.6                                Type2
                       0.5
                              0    4    8    12   16

                             Number of cores turned off
1.6
                       DYNCTA   Optimal TLP
Normalized IPC

                 1.4

                 1.2

                  1

                 0.8
1.6
                       TL   DYNCTA   Optimal TLP
Normalized IPC

                 1.4

                 1.2

                  1

                 0.8
1.6
                         Round Trip Fetch Latency   Core Utilization
Normalized Value

                   1.2

                   0.8

                   0.4

                    0
Average n   N   opt   RACT
                 4.5                         1




                                                   Active time ratio (RACT)
                  4                          0.9
Number of CTAs


                 3.5                         0.8
                  3                          0.7
                                             0.6
                 2.5
                                             0.5
                  2
                                             0.4
                 1.5                         0.3
                  1                          0.2
                 0.5                         0.1
                  0                          0
Average n   N   opt   RACT
                 9                         1




                                                 Active time ratio (RACT)
                 8                         0.9
Number of CTAs


                 7                         0.8
                 6                         0.7
                                           0.6
                 5
                                           0.5
                 4
                                           0.4
                 3                         0.3
                 2                         0.2
                 1                         0.1
                 0                         0
Average n   N   opt   RACT
                 7                         1




                                                 Active time ratio (RACT)
                                           0.9
                 6
Number of CTAs


                                           0.8
                 5                         0.7
                 4                         0.6
                                           0.5
                 3                         0.4
                 2                         0.3
                                           0.2
                 1
                                           0.1
                 0                         0
Average n   N   opt   RACT
                 9                         1




                                                 Active time ratio (RACT)
                 8                         0.9
Number of CTAs


                 7                         0.8
                 6                         0.7
                                           0.6
                 5
                                           0.5
                 4
                                           0.4
                 3                         0.3
                 2                         0.2
                 1                         0.1
                 0                         0
Average number of
      CTAs




0
    2
        4
            6
                8
                N
                DYNCTA
                Optimal
1.4
                         DYNCTA   DYNCORE with power gating   DYNCORE without power gating
Normalized Power

                   1.2
                    1
                   0.8
                   0.6
                   0.4
3
                              DYNCTA    DYNCORE with power gating   DYNCORE without power gating
                        2.9
Normalized Energy

                2.6
                                  2.6
                2.2
   Efficiency


                1.8
                1.4
                    1
                0.6
IPC         power         energy
                   2
Normalized Value



           1.5

                   1

           0.5

                   0
                        4          8        12           16
                            Number of Cores Turned Off
IPC        power      energy
                   2
Normalized Value



           1.5

                   1

           0.5

                   0
                             64              121
                       Number of Nodes in the System

More Related Content

Similar to Figures src1

14157565 embedded-programming
14157565 embedded-programming14157565 embedded-programming
14157565 embedded-programmingPRADEEP
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
B tech Final Year Projects & Embedded Systems Training
B tech Final Year Projects & Embedded Systems Training B tech Final Year Projects & Embedded Systems Training
B tech Final Year Projects & Embedded Systems Training Technogroovy India
 
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3Takuya ASADA
 
Performance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File SystemsPerformance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File Systemspeterhoneyman
 
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72Shinya Takamaeda-Y
 
Jag Trasgo Helsinki091002
Jag Trasgo Helsinki091002Jag Trasgo Helsinki091002
Jag Trasgo Helsinki091002Miguel Morales
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage johnInterlatin
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersVittorio Giovara
 
Day1 02.Introduction
Day1 02.IntroductionDay1 02.Introduction
Day1 02.IntroductionRon Liu
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
8051microcontroller
8051microcontroller 8051microcontroller
8051microcontroller manish080
 
Dynamo 100107092845-phpapp02
Dynamo 100107092845-phpapp02Dynamo 100107092845-phpapp02
Dynamo 100107092845-phpapp02Takefumi MIYOSHI
 

Similar to Figures src1 (16)

14157565 embedded-programming
14157565 embedded-programming14157565 embedded-programming
14157565 embedded-programming
 
8051 MICROCONTROLLER
8051 MICROCONTROLLER 8051 MICROCONTROLLER
8051 MICROCONTROLLER
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
9th ICCS Noordwijkerhout
9th ICCS Noordwijkerhout9th ICCS Noordwijkerhout
9th ICCS Noordwijkerhout
 
B tech Final Year Projects & Embedded Systems Training
B tech Final Year Projects & Embedded Systems Training B tech Final Year Projects & Embedded Systems Training
B tech Final Year Projects & Embedded Systems Training
 
At mega16 summary
At mega16 summaryAt mega16 summary
At mega16 summary
 
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
 
Performance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File SystemsPerformance and Availability Tradeoffs in Replicated File Systems
Performance and Availability Tradeoffs in Replicated File Systems
 
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72
メニーコアプロセッサにおけるコア間通信レイテンシ隠蔽手法の検討 @IPSJ72
 
Jag Trasgo Helsinki091002
Jag Trasgo Helsinki091002Jag Trasgo Helsinki091002
Jag Trasgo Helsinki091002
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage john
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
 
Day1 02.Introduction
Day1 02.IntroductionDay1 02.Introduction
Day1 02.Introduction
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
8051microcontroller
8051microcontroller 8051microcontroller
8051microcontroller
 
Dynamo 100107092845-phpapp02
Dynamo 100107092845-phpapp02Dynamo 100107092845-phpapp02
Dynamo 100107092845-phpapp02
 

Figures src1

  • 1. 1.6 Minimum TLP Optimal TLP Normalized IPC 1.2 0.8 0.4
  • 2. C C C C C C L1 L1 L1 L1 L1 L1 On Chip Network L2 L2 L2 L2 DRAM DRAM DRAM DRAM
  • 3. Application Kernel ... Kernel ... Kernel CTA … CTA … CTA Warp Warp Thread …
  • 4. Active Time Ratio (RACT) 0% 100% 20% 40% 80% 60%
  • 5. IPC lat_rt util 1.4 1.2 AES 1 0.8 0.6 0.4 0.2 0 1 2 3 4
  • 6. IPC lat_rt util 1.4 1.2 CP 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8
  • 7. IPC lat_rt util 1.4 1.2 JPEG 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8
  • 8. 1.05 IPC lat_rt util MM 1 0.95 0.9 0.85 0.8 0.75 0.7 1 2 3 4 5 6 7 8
  • 9. 1.4 Normalized IPC 1.2 1 0.8 0.6 0.4 0.2 AES MM 0 JPEG CP 1 2 3 4 5 6 7 8 Number of CTAs
  • 10. 1.4 Normalized latency 1.2 1 0.8 0.6 0.4 0.2 AES MM 0 JPEG CP 1 2 3 4 5 6 7 8 Number of CTAs
  • 11. 1.4 Normalized core 1.2 utilization 1 0.8 0.6 0.4 0.2 AES MM 0 JPEG CP 1 2 3 4 5 6 7 8 Number of CTAs
  • 12. CTA 1 CTA 2 Core 1 CTA 3 Idle CTA 4 CTA 5 CTA 6 Core 2 CTA 7 CTA 8
  • 13. Core 1 CTA 1 CTA 6 CTA 2 CTA 7 I Core 2 CTA 3 CTA 8 d CTA 4 CTA 5 l e
  • 14. C_idle C_mem H L L M H L - - C_stall M - - H - : Increment n : Decrement n
  • 15. 1 Relative performance 0.9 0.8 0.7 Type1 0.6 Type2 0.5 0 4 8 12 16 Number of cores turned off
  • 16. 1 Relative performance 0.9 0.8 0.7 Type1 0.6 Type2 0.5 0 4 8 12 16 Number of cores turned off
  • 17. 1.6 DYNCTA Optimal TLP Normalized IPC 1.4 1.2 1 0.8
  • 18. 1.6 TL DYNCTA Optimal TLP Normalized IPC 1.4 1.2 1 0.8
  • 19. 1.6 Round Trip Fetch Latency Core Utilization Normalized Value 1.2 0.8 0.4 0
  • 20. Average n N opt RACT 4.5 1 Active time ratio (RACT) 4 0.9 Number of CTAs 3.5 0.8 3 0.7 0.6 2.5 0.5 2 0.4 1.5 0.3 1 0.2 0.5 0.1 0 0
  • 21. Average n N opt RACT 9 1 Active time ratio (RACT) 8 0.9 Number of CTAs 7 0.8 6 0.7 0.6 5 0.5 4 0.4 3 0.3 2 0.2 1 0.1 0 0
  • 22. Average n N opt RACT 7 1 Active time ratio (RACT) 0.9 6 Number of CTAs 0.8 5 0.7 4 0.6 0.5 3 0.4 2 0.3 0.2 1 0.1 0 0
  • 23. Average n N opt RACT 9 1 Active time ratio (RACT) 8 0.9 Number of CTAs 7 0.8 6 0.7 0.6 5 0.5 4 0.4 3 0.3 2 0.2 1 0.1 0 0
  • 24. Average number of CTAs 0 2 4 6 8 N DYNCTA Optimal
  • 25. 1.4 DYNCTA DYNCORE with power gating DYNCORE without power gating Normalized Power 1.2 1 0.8 0.6 0.4
  • 26. 3 DYNCTA DYNCORE with power gating DYNCORE without power gating 2.9 Normalized Energy 2.6 2.6 2.2 Efficiency 1.8 1.4 1 0.6
  • 27. IPC power energy 2 Normalized Value 1.5 1 0.5 0 4 8 12 16 Number of Cores Turned Off
  • 28. IPC power energy 2 Normalized Value 1.5 1 0.5 0 64 121 Number of Nodes in the System

Editor's Notes

  1. Slides 2-3-4-5 show some motivational figures. We are showing what happens to round trip latency and core utilization and IPC when we vary the number of CTAs on cores.
  2. This will be used in motivation section, as an illustrative example to the core utilization problem caused by executing more concurrent CTAs.
  3. This is used as a motivation to our power algorithm. It shows how the applications respond to turing off cores.Type 1 applications are memory intensive, Type 2 are compute intensive.We see that Type 1 applications can tolerate execution with smaller number of cores better compared to Type 2 applications.
  4. This is the main performance result. We are comparing the baseline, our algorithm and static best results.
  5. Slides 8-9 show the CTA modulation of six benchmarks over time. They are compared with the default CTA number, static best and the overall average of our algorithm over time.Secondary axis shows the ratio of active cycles to the total cycles, indicating how compute intensive that application is at that point of time.
  6. This figure shows the power consumption of benchmarks after we apply our power algorithm (smaller is better).First bar shows the power consumption of baseline, second bar shows the power consumption of the CTA algorithm,Third bar shows the power consumption when we can apply power gating And fourth bar shows the power consumption when there is no power gating.
  7. This is similar to the figure in 11. Instead of power, it is showing performance/power ratio (larger is better).So we can say that this is the energy efficiency graph.