SlideShare a Scribd company logo
Analysis tools for Evaluation and
          Performance

          Mourad Bouache
      PhD, Computer Architecture
           bouache@gmail.com




        Oracle - Nov, 14-2011
Introduction

Processors are increasingly complex
  • More difficult microarchitecture
    design.

Simulator : very important tool
  • Understand the instruction
    behavior during its execution in
    processor.

Complex Simulator :
  • Time for preparation and
    modification.
Introduction

Simulation
Simulator




Tool
  • Simulator : very important tool
  • test new concepts
Simulator




Tool
  • Simulator : very important tool
  • test new concepts


Three characteristics
Complexity of microarchitectures
Modular Simulation
Speed decreases as complexity increases
Contribution : vectorization methodology
Monolithic Simulation




Simplescalar, is the most used (in 70% of articles).
This simulator and most other simulators have a serious
drawback : monolithic

Advantage
  • simulation speed
Monolithic Simulation




Simplescalar, is the most used (in 70% of articles).
This simulator and most other simulators have a serious
drawback : monolithic

Advantage
  • simulation speed


Disadvantages
  • Difficult to update.
  • Difficult to extract and compare the simulator components.
Monolithic vs Modular
Modular simulation




Advantages
 • Reuse/ exchange and compare simulator modules,
Modular simulation




Advantages
 • Reuse/ exchange and compare simulator modules,
 • Better confidence in simulation (closer to HW),
Modular simulation




Advantages
 • Reuse/ exchange and compare simulator modules,
 • Better confidence in simulation (closer to HW),
 • Easier to read.
Modular simulation




Advantages
  • Reuse/ exchange and compare simulator modules,
  • Better confidence in simulation (closer to HW),
  • Easier to read.

Main drawback :
  • Simulation speed slowdown
Outline




1   Modular simulation environment
2   Acceleration techniques
3   Vectorization of Simulator Modules
4   Experimental framework
5   Results
6   Scheduling process in SystemC
7   Conclusion & future works.
Modular simulation environments




• A modular simulation environment describe hierarchically and
  structurally the system to simulate.
  To simulate the entire system, the environment includes a
  scheduler controlling the performance of different components.
Modular simulation environments




• A modular simulation environment describe hierarchically and
  structurally the system to simulate.
  To simulate the entire system, the environment includes a
  scheduler controlling the performance of different components.
• Key benefits :
Modular simulation environments

Reuse ...
Modular simulation environments


Compare ...
Modular simulation environments

Share ...
Simulation models
acceleration techniques




Acceleration techniques
  • reduction of inputs and simulation programs : MinneSPEC,
  • simulation engine optimization : FastSysC 1 (speedX 2),
  • distribution of simulation : DisT,
  • sampling techniques : representative, periodic and
       random sampling,
  • transition to modeling TTLM :Timed Transaction Level
       Modeling.




  1.   Daniel Gracia Perez et al. FastSysC : a fast SystemC engine
acceleration techniques




Acceleration techniques
    • Compromise between accuracy and simulation speed,




     2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-
torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
acceleration techniques




Acceleration techniques
    • Compromise between accuracy and simulation speed,
    • Vectorization 2 is a methodology that can be used with
       one of these acceleration techniques.




     2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-
torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
Modular simulation environment




UNISIM 3 : A modular simulation framework

  • UNISIM is a modular framework for simulation, each simulator
    is divided into several modules, each module corresponding to
    a hardware block.




  3. http ://www.unisim.org/
Modular simulation environment




UNISIM 3 : A modular simulation framework

  • UNISIM is a modular framework for simulation, each simulator
    is divided into several modules, each module corresponding to
    a hardware block.
  • A module is composed of two parts : state and processes.




  3. http ://www.unisim.org/
Modular simulation environment

UNISIM : A modular simulation framework

  • A process is defined in a .sim file as a C++ class
UNISIM : Communication protocol




Communication protocol
  • Ports : inports and outports
  • Signals
UNISIM : Communication protocol




Communication protocol
  • Ports : inports and outports
  • Signals


3 signals :
  • Processes can be sensitive to
     the data, the accept and
     the enable signals.
Communication protocol



UNISIM : signals

  • The simulation engine (SystemC) wakes up the modules
    process.
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
UNISIM : Communication protocol


Communication between modules
Communication protocol




Communication between modules
Scalability is difficult with a modular simulation, for two factors :
  • Communication costs between the simulator modules.
  • Awakening process for each communicating module.
Communication costs




Monolithic Simulator
  • Write/read a variable.
Communication costs




Monolithic Simulator
  • Write/read a variable.


Modular Simulator
A New Communication Protocol



Signals Array
  • Reduce the number of signals,
  • Several values of data, accept, enable temporarily stored in
    signals array.
A New Communication Protocol



Signals Array
  • An extension of the communication protocol between modules
    is a solution to accelerate a simulation speed.
Module Vectorization




A simple and systematic procedure
 1   vectorize module state and ports,
 2   add a loop around the process,
 3   add method calls to send() following the addition of for
     loops.
Example : Functional Unit


 1   class FunctionalUnit : public module
 2   { public :
 3       inclock clock ;
 4       inport < instr > in ;
 5       outport < instr > out ;
 6      FunctionalUnit ( const char * name ): module ( name )
 7       { sensitive_pos_method ( start_of_cycle ) << clock ;
 8         sensitive_neg_method ( end_of_cycle ) << clock ;
 9         sensitive_method ( on_data_accept ) << in . data << out . accept ;
10       }
11       void start_of_cycle ()
12       { if ( pipeline . is_ready ())
13            out . data = pipeline . get ();
14         else out . data . nothing ();
15       }
16       void on_data_accept ()
17       { if ( in . data . know () && out . accept . know ())
18         { if (! pipeline . is_full () || out . accept )
19               in . accept = true ;
20            else in . accept = false ;
21            out . enable = out . accept ;
22         }
23       }
24       void end_of_cycle ()
25       { if ( out . accept ) pipeline . pop ();
26         if ( in . enable ) pipeline . push ( in . data );
27         pipeline . run ();
28       }
29      private :
30       Fifo < instr > pipeline ;
31   };
Module Vectorization




    Vectorization Procedure
    1. vectorize module state and ports.
1   class FunctionalUnit : public module   1   class FunctionalUnit : public module
2   { public :                             2   { public :
3       inclock clock ;                    3     inclock clock ;
4       inport < instr > in ;              4     inport < instr , NBCFG > in ;
5       outport < instr > out ;            5     outport < instr , NBCFG > out ;
6   ...                                    6   ...
7     private :                            7    private :
8       Fifo < instr > pipeline ;          8     Fifo < instr > pipeline [ NBCFG ];
Module Vectorization



     Vectorization procedure
     2. add a loop around the process.
                                                                  1   ...
                                                                  2   void start_of_cycle ()
                                                                  3     { for ( int cfg =0; cfg < NBCFG; cfg ++)
                                                                  4       {
 1   ...                                                          5         if ( pipeline [ cfg ]. is_ready ())
 2   void start_of_cycle ()                                       6            out . data [ cfg ] = pipeline [ cfg ]. get ();
 3       { if ( pipeline . is_ready ())                           7         else out . data [ cfg ]. nothing ();
 4           out . data = pipeline . get ();                      8         ...
 5         else out . data . nothing ();                          9       }
 6       }                                                       10     }
 7   void on_data_accept ()                                      11   void on_data_accept ()
 8       { if ( in . data . know () && out . accept . know ())   12     { if ( in . data . know () && out . accept . know ())
 9         { if (! pipeline . is_full () || out . accept )       13       { for ( int cfg =0; cfg < NBCFG; cfg ++)
10              in . accept = true ;                             14          { if (! pipeline [ cfg ]. is_full ()
11           else in . accept = false ;                          15                   || out . accept [ cfg ])
12           out . enable = out . accept ;                       16                in . accept [ cfg ] = true ;
13         }                                                     17             else in . accept [ cfg ] = false ;
14       }                                                       18             out . enable [ cfg ] = out . accept [ cfg ];
15   ...                                                         19             ...
                                                                 20          }
                                                                 21       }
                                                                 22     }
                                                                 23   ...
Module Vectorization



     Vectorization procedure
     3. add method calls to send() following the addition of for loops.
                                                                  1   ...
                                                                  2     void start_of_cycle ()
                                                                  3     { for ( int cfg =0; cfg < NBCFG; cfg ++)
                                                                  4       {
                                                                  5         if ( pipeline [ cfg ]. is_ready ())
 1   ...                                                          6            out . data [ cfg ] = pipeline [ cfg ]. get ();
 2   void start_of_cycle ()                                       7         else out . data [ cfg ]. nothing ();
 3       { if ( pipeline . is_ready ())                           8       }
 4           out . data = pipeline . get ();                      9       out . data. send ();
 5         else out . data . nothing ();                         10     }
 6       }                                                       11     void on_data_accept ()
 7   void on_data_accept ()                                      12     { if ( in . data . know () && out . accept . know ())
 8       { if ( in . data . know () && out . accept . know ())   13       { for ( int cfg =0; cfg < NBCFG; cfg ++)
 9         { if (! pipeline . is_full () || out . accept )       14          { if (! pipeline [ cfg ]. is_full ()
10              in . accept = true ;                             15                   || out . accept [ cfg ])
11           else in . accept = false ;
12           out . enable = out . accept ;                       16                in . accept [ cfg ] = true ;
13         }                                                     17             else in . accept [ cfg ] = false ;
14       }                                                       18             out . enable [ cfg ] = out . accept [ cfg ];
15   ...                                                         19          }
                                                                 20          in . accept . send ();
                                                                 21          out . enable . send ();
                                                                 22       }
                                                                 23     }
                                                                 24   ...
Example : Vectorized Functional Unit

 1   class FunctionalUnit : public module
 2   { public :
 3      inclock clock;
 4      inport < instr , NBCFG > in ;
 5      outport < instr , NBCFG > out ;
 6      FunctionalUnit ( const char * name ): module ( name )
 7      { // sensitive list
 8        sensitive_pos_method ( start_of_cycle ) << clock ;
 9        sensitive_neg_method ( end_of_cycle ) << clock ;
10        sensitive_method ( on_data_accept ) << in . data << out . accept ;
11      }
12      void start_of_cycle ()
13      { for ( int cfg =0; cfg < NBCFG; cfg ++)
14         {
15           if ( pipeline [ cfg ]. is_ready ())
16              out . data[ cfg ] = pipeline [ cfg ]. get ();
17           else out . data [ cfg ]. nothing ();
18         }
19         out . data . send ();
20      }
21      void on_data_accept ()
22      { if ( in . data. know () && out . accept . know ())
23         { for ( int cfg =0; cfg < NBCFG; cfg ++)
24            { if (! pipeline [ cfg ]. is_full () || out . accept [ cfg ])
25                  in . accept [ cfg ] = true ;
26               else in . accept [ cfg ] = false ;
27               out . enable [ cfg ] = out . accept [ cfg ];
28            }
29            in . accept . send();
30            out . enable . send ();
31         }
32      }
33      void end_of_cycle ()
34      { for ( int cfg =0; cfg < NBCFG; cfg ++)
35         { if ( out . accept [ cfg ]) pipeline [ cfg ]. pop ();
36            if ( in . enable [ cfg ]) pipeline [ cfg ]. push ( in . data );
37            pipeline [ cfg ]. run ();
38         }
39      }
40      private :
41      Fifo < instr > pipeline [ NBCFG ];
42   };
Simulator Vectorization




Multi-cores Simulation
  • In our study, we performed simulations of multi-cores : 2, 4, 8,
    16, 32 and 64.
OoOSim : Out of Order Simulator




OoOSim 4 modelises a generic superscalar out-of-order processor.
The baseline simulator includes a 4-way superscalar core with an L1
instruction cache, an L1 write-back data cache, a bus and a dram.




    4. Mourad Bouache, David Parello, Bernard Goossens. Acceleration of Modular simulation. In International
Supercomputing Conference (ISC09) Hamburg, Germany, June 2009.
OoOSim : Out of Order Simulator


OoOSim : 12 modules
 1 Fetcher,
 2 AllocatorRenamer,
 3 Dispatcher,
 4 Scheduler,
 5 RegisterFile,
 6 Ret-Broadcast and CDBA:Common Data Bus Arbiter,
 7 IntegerUnit, FloatingPointUnit and AddressGenerationUnit,
 8 LoadStoreQueue,
 9 Data caches L1 and L2,
 10 Instruction cache L1,
 11 Memory DRAM,
 12 Reorder Buffer.
OoOSim : Out of Order Simulator




more than 15.000 code lines, 12 connected modules through 187 signals.
Benchmarks



Benchmarks : MiBench

  • Simulations were carried out by MiBench, divided into six
    suites targeted areas specific market for embedded
    applications :
    Automotive, Network, Security, Consumer Devices,
    Office Automation, and Telecommunications.

     Auto./Industrial    Consummer   Office           Network    Security   Telecomm.
     susan (edges)       jpeg        stringsearch   dijkstra   sha        FFT
     susan (corners)     -           -              -          rijndael   -
     susan (smoothing)   -           -              -          -          -
Performance evaluation




Simulation machine
  • Performance evaluation has been carried out on a cluster of
    30 Intel Xeon 5148 dual-core processors clocked at
    2.33GHz with a 4MBytes L2 cache.
Results : simulation speed (without vectorization)
simulation speed (with vectorization)
Results : speedup
Why ... ?




Instrumentation of the FastSysC code(program)

  • Cycle Counters (RDTSC:Read Time Stamp Counter) :
      1 The scheduler FastSysC transit time.
      2 The process time.
FastSysC transit time(without/with vectorization)
Conclusion




Results
  • To address the need to improve the simulation speed, we
    proposed a developing modules methodology in a modular
    simulator.
  • This methodology is based on a new communication signals
    protocol .



The vectorial simulation improves scalability.
Results Discussion




Vectorization ...
  • improves the speedup of the simulation time.
  • it allows duplicate resources by limiting the overhead of
    scheduler simulation time.
  • can be used in conjunction with other techniques to
    improve the speed as sampling techniques or reduction
    of test programs.
Results Discussion


Vectorization ...
Conclusion




Conclusion
Our contribution aims to improve the simulation speed in
modular simulators, offering a simple and systematic
development based on the vectorization of the simulator
modules.
Conclusion



Simplescalar is not a multi-core simulator
Conclusion



Simplescalar is not a multi-core simulator
In focus




Other idea ...
  • Vectorization
    We wish to compare the results of this methodology using
    TTLM modeling (Timed Transaction Level Modeling).
Merci, Thank you, Tack




QUESTIONS ?
Back-up slides



Post-doc research work
  • Instruction Level Parallelism : ILP
    Goal : understand the general structure of an execution and
    parallelism it offers.
  • PerPi : A Tool to Measure Instruction Level Parallelism
      •   http://kenny.univ-perp.fr/PerPi/
      •   A Pin tool, an Intel free programmable tool,
      •   computes the instructions dependency graph,
      •   computes, for each instruction in the run, its instruction cycle in the ideal
          machine,
      •   Analysis of the structure of instruction-level parallelism,
      •   Parallelism on loops,
      •   Local and global parallelism,
      •   Parallelism on function ”CALL”.
Back-up slides


Pin Tool
Back-up slides




TTLM
Back-up slides


SystemC and FastSysC
SystemC, Contains a scheduler which manages signals and directs
the process to start. It contains a sequential processes (sensitive to
the clock) and combinatorial process (sensitive to input ports).
FastSysC, a mixture of static and dynamic scheduling to avoid
unnecessary awakening processes : thus optimize the simulation
engine.
Back-up slides


Monolithic
Back-up slides


Modular
Back-up slides




Parallel Simulation
Back-up slides




Sampling I
Back-up slides




Sampling II
Back-up slides

MiBench
Back-up slides


Use of OoOSim
Back-up slides

Stringsearch
Back-up slides

flight-trace simulation
Back-up slides




execution-driven simulation
Back-up slides




trace-driven simulation
Back-up slides


Unisim Example
Back-up slides


UNISIM History

More Related Content

What's hot

Processes, Threads and Scheduler
Processes, Threads and SchedulerProcesses, Threads and Scheduler
Processes, Threads and Scheduler
Munazza-Mah-Jabeen
 
OS Process and Thread Concepts
OS Process and Thread ConceptsOS Process and Thread Concepts
OS Process and Thread Concepts
sgpraju
 
UVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATIONUVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATION
IAEME Publication
 
Windows process-scheduling
Windows process-schedulingWindows process-scheduling
Windows process-scheduling
Talha Shaikh
 
Windows process scheduling presentation
Windows process scheduling presentationWindows process scheduling presentation
Windows process scheduling presentation
Talha Shaikh
 
Esl basics
Esl basicsEsl basics
Esl basics
敬倫 林
 

What's hot (6)

Processes, Threads and Scheduler
Processes, Threads and SchedulerProcesses, Threads and Scheduler
Processes, Threads and Scheduler
 
OS Process and Thread Concepts
OS Process and Thread ConceptsOS Process and Thread Concepts
OS Process and Thread Concepts
 
UVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATIONUVM ARCHITECTURE FOR VERIFICATION
UVM ARCHITECTURE FOR VERIFICATION
 
Windows process-scheduling
Windows process-schedulingWindows process-scheduling
Windows process-scheduling
 
Windows process scheduling presentation
Windows process scheduling presentationWindows process scheduling presentation
Windows process scheduling presentation
 
Esl basics
Esl basicsEsl basics
Esl basics
 

Viewers also liked

860 presentation
860 presentation860 presentation
860 presentation
Chandler_Carver
 
Beauty service for men
Beauty service for menBeauty service for men
Beauty service for men
Schiller International University
 
Water issues in mozambique
Water issues in mozambiqueWater issues in mozambique
Water issues in mozambique
Schiller International University
 
Rp groom lake
Rp groom lakeRp groom lake
Rp groom lake
RebelLeader
 
Chandler carver 890 presentation
Chandler carver 890 presentationChandler carver 890 presentation
Chandler carver 890 presentation
Chandler_Carver
 
Chandler carver presentation
Chandler carver presentationChandler carver presentation
Chandler carver presentation
Chandler_Carver
 
Jfc fuller
Jfc fullerJfc fuller
Dreams work animations vs goldman sachs
Dreams work animations vs goldman sachsDreams work animations vs goldman sachs
Dreams work animations vs goldman sachs
Schiller International University
 
International marketing complete
International marketing completeInternational marketing complete
International marketing complete
Schiller International University
 
Spansion FL-S Serial NOR Flash Memory
Spansion FL-S Serial NOR Flash MemorySpansion FL-S Serial NOR Flash Memory
Spansion FL-S Serial NOR Flash Memory
Spansion
 
5 cr 2
5 cr 25 cr 2
5 cr 2
tcast
 
LBIX Presentation
LBIX PresentationLBIX Presentation
LBIX Presentation
appointmentset
 
5 cr 1
5 cr 15 cr 1
5 cr 1
tcast
 
5 cr 1
5 cr 15 cr 1
5 cr 1
tcast
 
Fdi presentation ib_final fully
Fdi presentation ib_final fullyFdi presentation ib_final fully
Fdi presentation ib_final fully
Schiller International University
 
Zara
Zara Zara
Junaid jamshed
Junaid jamshedJunaid jamshed
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences. Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
Marie_Estager
 
These pro enass cedric tang 2012
These pro enass cedric tang 2012These pro enass cedric tang 2012
These pro enass cedric tang 2012
cedric1975
 

Viewers also liked (20)

860 presentation
860 presentation860 presentation
860 presentation
 
Beauty service for men
Beauty service for menBeauty service for men
Beauty service for men
 
Water issues in mozambique
Water issues in mozambiqueWater issues in mozambique
Water issues in mozambique
 
Presentac..
Presentac..Presentac..
Presentac..
 
Rp groom lake
Rp groom lakeRp groom lake
Rp groom lake
 
Chandler carver 890 presentation
Chandler carver 890 presentationChandler carver 890 presentation
Chandler carver 890 presentation
 
Chandler carver presentation
Chandler carver presentationChandler carver presentation
Chandler carver presentation
 
Jfc fuller
Jfc fullerJfc fuller
Jfc fuller
 
Dreams work animations vs goldman sachs
Dreams work animations vs goldman sachsDreams work animations vs goldman sachs
Dreams work animations vs goldman sachs
 
International marketing complete
International marketing completeInternational marketing complete
International marketing complete
 
Spansion FL-S Serial NOR Flash Memory
Spansion FL-S Serial NOR Flash MemorySpansion FL-S Serial NOR Flash Memory
Spansion FL-S Serial NOR Flash Memory
 
5 cr 2
5 cr 25 cr 2
5 cr 2
 
LBIX Presentation
LBIX PresentationLBIX Presentation
LBIX Presentation
 
5 cr 1
5 cr 15 cr 1
5 cr 1
 
5 cr 1
5 cr 15 cr 1
5 cr 1
 
Fdi presentation ib_final fully
Fdi presentation ib_final fullyFdi presentation ib_final fully
Fdi presentation ib_final fully
 
Zara
Zara Zara
Zara
 
Junaid jamshed
Junaid jamshedJunaid jamshed
Junaid jamshed
 
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences. Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
Réseaux Sociaux : Quand le marketing et la DSI partagent leurs expériences.
 
These pro enass cedric tang 2012
These pro enass cedric tang 2012These pro enass cedric tang 2012
These pro enass cedric tang 2012
 

Similar to Tools for analysis and evaluation of CPU Performance

NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentation
Hossam Hassan
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
DeepakJangid87
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Lionel Briand
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
A_Brief_Summary_on_Summer_Courses[1]
A_Brief_Summary_on_Summer_Courses[1]A_Brief_Summary_on_Summer_Courses[1]
A_Brief_Summary_on_Summer_Courses[1]
Gayatri Kindo
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Michał Waleszczuk
 
Unit i
Unit iUnit i
Discrete event simulation
Discrete event simulationDiscrete event simulation
Discrete event simulation
ssusera970cc
 
Enabling Model Testing of Cyber Physical Systems
Enabling Model Testing of Cyber Physical SystemsEnabling Model Testing of Cyber Physical Systems
Enabling Model Testing of Cyber Physical Systems
Lionel Briand
 
8051 Embedded Programming in C - Book-II
8051 Embedded Programming in C - Book-II8051 Embedded Programming in C - Book-II
8051 Embedded Programming in C - Book-II
handson28
 
ScilabTEC 2015 - Noesis Solutions
ScilabTEC 2015 - Noesis SolutionsScilabTEC 2015 - Noesis Solutions
ScilabTEC 2015 - Noesis Solutions
Scilab
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
ambitlick
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptx
nikitha992646
 
Soc.pptx
Soc.pptxSoc.pptx
Soc.pptx
Jagu Mounica
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Scan insertion
Scan insertionScan insertion
Scan insertion
kumar gavanurmath
 
OPAL-RT HYPERSIM Features applied for Relay Testing
OPAL-RT HYPERSIM Features applied for Relay TestingOPAL-RT HYPERSIM Features applied for Relay Testing
OPAL-RT HYPERSIM Features applied for Relay Testing
OPAL-RT TECHNOLOGIES
 
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdfDCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
Okeke Livinus
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
A B Shinde
 
Fuzzing101 - webinar on Fuzzing Performance
Fuzzing101 - webinar on Fuzzing PerformanceFuzzing101 - webinar on Fuzzing Performance
Fuzzing101 - webinar on Fuzzing Performance
Codenomicon
 

Similar to Tools for analysis and evaluation of CPU Performance (20)

NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentation
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
A_Brief_Summary_on_Summer_Courses[1]
A_Brief_Summary_on_Summer_Courses[1]A_Brief_Summary_on_Summer_Courses[1]
A_Brief_Summary_on_Summer_Courses[1]
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Unit i
Unit iUnit i
Unit i
 
Discrete event simulation
Discrete event simulationDiscrete event simulation
Discrete event simulation
 
Enabling Model Testing of Cyber Physical Systems
Enabling Model Testing of Cyber Physical SystemsEnabling Model Testing of Cyber Physical Systems
Enabling Model Testing of Cyber Physical Systems
 
8051 Embedded Programming in C - Book-II
8051 Embedded Programming in C - Book-II8051 Embedded Programming in C - Book-II
8051 Embedded Programming in C - Book-II
 
ScilabTEC 2015 - Noesis Solutions
ScilabTEC 2015 - Noesis SolutionsScilabTEC 2015 - Noesis Solutions
ScilabTEC 2015 - Noesis Solutions
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptx
 
Soc.pptx
Soc.pptxSoc.pptx
Soc.pptx
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Scan insertion
Scan insertionScan insertion
Scan insertion
 
OPAL-RT HYPERSIM Features applied for Relay Testing
OPAL-RT HYPERSIM Features applied for Relay TestingOPAL-RT HYPERSIM Features applied for Relay Testing
OPAL-RT HYPERSIM Features applied for Relay Testing
 
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdfDCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
DCS_Check-Out_and_Operator_Training_with_HYSYS_Dynamics_White_v1.3.pdf
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
Fuzzing101 - webinar on Fuzzing Performance
Fuzzing101 - webinar on Fuzzing PerformanceFuzzing101 - webinar on Fuzzing Performance
Fuzzing101 - webinar on Fuzzing Performance
 

Tools for analysis and evaluation of CPU Performance

  • 1. Analysis tools for Evaluation and Performance Mourad Bouache PhD, Computer Architecture bouache@gmail.com Oracle - Nov, 14-2011
  • 2. Introduction Processors are increasingly complex • More difficult microarchitecture design. Simulator : very important tool • Understand the instruction behavior during its execution in processor. Complex Simulator : • Time for preparation and modification.
  • 4. Simulator Tool • Simulator : very important tool • test new concepts
  • 5. Simulator Tool • Simulator : very important tool • test new concepts Three characteristics
  • 8. Speed decreases as complexity increases
  • 10. Monolithic Simulation Simplescalar, is the most used (in 70% of articles). This simulator and most other simulators have a serious drawback : monolithic Advantage • simulation speed
  • 11. Monolithic Simulation Simplescalar, is the most used (in 70% of articles). This simulator and most other simulators have a serious drawback : monolithic Advantage • simulation speed Disadvantages • Difficult to update. • Difficult to extract and compare the simulator components.
  • 13. Modular simulation Advantages • Reuse/ exchange and compare simulator modules,
  • 14. Modular simulation Advantages • Reuse/ exchange and compare simulator modules, • Better confidence in simulation (closer to HW),
  • 15. Modular simulation Advantages • Reuse/ exchange and compare simulator modules, • Better confidence in simulation (closer to HW), • Easier to read.
  • 16. Modular simulation Advantages • Reuse/ exchange and compare simulator modules, • Better confidence in simulation (closer to HW), • Easier to read. Main drawback : • Simulation speed slowdown
  • 17. Outline 1 Modular simulation environment 2 Acceleration techniques 3 Vectorization of Simulator Modules 4 Experimental framework 5 Results 6 Scheduling process in SystemC 7 Conclusion & future works.
  • 18. Modular simulation environments • A modular simulation environment describe hierarchically and structurally the system to simulate. To simulate the entire system, the environment includes a scheduler controlling the performance of different components.
  • 19. Modular simulation environments • A modular simulation environment describe hierarchically and structurally the system to simulate. To simulate the entire system, the environment includes a scheduler controlling the performance of different components. • Key benefits :
  • 24. acceleration techniques Acceleration techniques • reduction of inputs and simulation programs : MinneSPEC, • simulation engine optimization : FastSysC 1 (speedX 2), • distribution of simulation : DisT, • sampling techniques : representative, periodic and random sampling, • transition to modeling TTLM :Timed Transaction Level Modeling. 1. Daniel Gracia Perez et al. FastSysC : a fast SystemC engine
  • 25. acceleration techniques Acceleration techniques • Compromise between accuracy and simulation speed, 2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec- torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
  • 26. acceleration techniques Acceleration techniques • Compromise between accuracy and simulation speed, • Vectorization 2 is a methodology that can be used with one of these acceleration techniques. 2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec- torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
  • 27. Modular simulation environment UNISIM 3 : A modular simulation framework • UNISIM is a modular framework for simulation, each simulator is divided into several modules, each module corresponding to a hardware block. 3. http ://www.unisim.org/
  • 28. Modular simulation environment UNISIM 3 : A modular simulation framework • UNISIM is a modular framework for simulation, each simulator is divided into several modules, each module corresponding to a hardware block. • A module is composed of two parts : state and processes. 3. http ://www.unisim.org/
  • 29. Modular simulation environment UNISIM : A modular simulation framework • A process is defined in a .sim file as a C++ class
  • 30. UNISIM : Communication protocol Communication protocol • Ports : inports and outports • Signals
  • 31. UNISIM : Communication protocol Communication protocol • Ports : inports and outports • Signals 3 signals : • Processes can be sensitive to the data, the accept and the enable signals.
  • 32. Communication protocol UNISIM : signals • The simulation engine (SystemC) wakes up the modules process.
  • 33. UNISIM : Communication protocol Communication between modules
  • 34. UNISIM : Communication protocol Communication between modules
  • 35. UNISIM : Communication protocol Communication between modules
  • 36. UNISIM : Communication protocol Communication between modules
  • 37. UNISIM : Communication protocol Communication between modules
  • 38. UNISIM : Communication protocol Communication between modules
  • 39. UNISIM : Communication protocol Communication between modules
  • 40. UNISIM : Communication protocol Communication between modules
  • 41. UNISIM : Communication protocol Communication between modules
  • 42. Communication protocol Communication between modules Scalability is difficult with a modular simulation, for two factors : • Communication costs between the simulator modules. • Awakening process for each communicating module.
  • 43. Communication costs Monolithic Simulator • Write/read a variable.
  • 44. Communication costs Monolithic Simulator • Write/read a variable. Modular Simulator
  • 45. A New Communication Protocol Signals Array • Reduce the number of signals, • Several values of data, accept, enable temporarily stored in signals array.
  • 46. A New Communication Protocol Signals Array • An extension of the communication protocol between modules is a solution to accelerate a simulation speed.
  • 47. Module Vectorization A simple and systematic procedure 1 vectorize module state and ports, 2 add a loop around the process, 3 add method calls to send() following the addition of for loops.
  • 48. Example : Functional Unit 1 class FunctionalUnit : public module 2 { public : 3 inclock clock ; 4 inport < instr > in ; 5 outport < instr > out ; 6 FunctionalUnit ( const char * name ): module ( name ) 7 { sensitive_pos_method ( start_of_cycle ) << clock ; 8 sensitive_neg_method ( end_of_cycle ) << clock ; 9 sensitive_method ( on_data_accept ) << in . data << out . accept ; 10 } 11 void start_of_cycle () 12 { if ( pipeline . is_ready ()) 13 out . data = pipeline . get (); 14 else out . data . nothing (); 15 } 16 void on_data_accept () 17 { if ( in . data . know () && out . accept . know ()) 18 { if (! pipeline . is_full () || out . accept ) 19 in . accept = true ; 20 else in . accept = false ; 21 out . enable = out . accept ; 22 } 23 } 24 void end_of_cycle () 25 { if ( out . accept ) pipeline . pop (); 26 if ( in . enable ) pipeline . push ( in . data ); 27 pipeline . run (); 28 } 29 private : 30 Fifo < instr > pipeline ; 31 };
  • 49. Module Vectorization Vectorization Procedure 1. vectorize module state and ports. 1 class FunctionalUnit : public module 1 class FunctionalUnit : public module 2 { public : 2 { public : 3 inclock clock ; 3 inclock clock ; 4 inport < instr > in ; 4 inport < instr , NBCFG > in ; 5 outport < instr > out ; 5 outport < instr , NBCFG > out ; 6 ... 6 ... 7 private : 7 private : 8 Fifo < instr > pipeline ; 8 Fifo < instr > pipeline [ NBCFG ];
  • 50. Module Vectorization Vectorization procedure 2. add a loop around the process. 1 ... 2 void start_of_cycle () 3 { for ( int cfg =0; cfg < NBCFG; cfg ++) 4 { 1 ... 5 if ( pipeline [ cfg ]. is_ready ()) 2 void start_of_cycle () 6 out . data [ cfg ] = pipeline [ cfg ]. get (); 3 { if ( pipeline . is_ready ()) 7 else out . data [ cfg ]. nothing (); 4 out . data = pipeline . get (); 8 ... 5 else out . data . nothing (); 9 } 6 } 10 } 7 void on_data_accept () 11 void on_data_accept () 8 { if ( in . data . know () && out . accept . know ()) 12 { if ( in . data . know () && out . accept . know ()) 9 { if (! pipeline . is_full () || out . accept ) 13 { for ( int cfg =0; cfg < NBCFG; cfg ++) 10 in . accept = true ; 14 { if (! pipeline [ cfg ]. is_full () 11 else in . accept = false ; 15 || out . accept [ cfg ]) 12 out . enable = out . accept ; 16 in . accept [ cfg ] = true ; 13 } 17 else in . accept [ cfg ] = false ; 14 } 18 out . enable [ cfg ] = out . accept [ cfg ]; 15 ... 19 ... 20 } 21 } 22 } 23 ...
  • 51. Module Vectorization Vectorization procedure 3. add method calls to send() following the addition of for loops. 1 ... 2 void start_of_cycle () 3 { for ( int cfg =0; cfg < NBCFG; cfg ++) 4 { 5 if ( pipeline [ cfg ]. is_ready ()) 1 ... 6 out . data [ cfg ] = pipeline [ cfg ]. get (); 2 void start_of_cycle () 7 else out . data [ cfg ]. nothing (); 3 { if ( pipeline . is_ready ()) 8 } 4 out . data = pipeline . get (); 9 out . data. send (); 5 else out . data . nothing (); 10 } 6 } 11 void on_data_accept () 7 void on_data_accept () 12 { if ( in . data . know () && out . accept . know ()) 8 { if ( in . data . know () && out . accept . know ()) 13 { for ( int cfg =0; cfg < NBCFG; cfg ++) 9 { if (! pipeline . is_full () || out . accept ) 14 { if (! pipeline [ cfg ]. is_full () 10 in . accept = true ; 15 || out . accept [ cfg ]) 11 else in . accept = false ; 12 out . enable = out . accept ; 16 in . accept [ cfg ] = true ; 13 } 17 else in . accept [ cfg ] = false ; 14 } 18 out . enable [ cfg ] = out . accept [ cfg ]; 15 ... 19 } 20 in . accept . send (); 21 out . enable . send (); 22 } 23 } 24 ...
  • 52. Example : Vectorized Functional Unit 1 class FunctionalUnit : public module 2 { public : 3 inclock clock; 4 inport < instr , NBCFG > in ; 5 outport < instr , NBCFG > out ; 6 FunctionalUnit ( const char * name ): module ( name ) 7 { // sensitive list 8 sensitive_pos_method ( start_of_cycle ) << clock ; 9 sensitive_neg_method ( end_of_cycle ) << clock ; 10 sensitive_method ( on_data_accept ) << in . data << out . accept ; 11 } 12 void start_of_cycle () 13 { for ( int cfg =0; cfg < NBCFG; cfg ++) 14 { 15 if ( pipeline [ cfg ]. is_ready ()) 16 out . data[ cfg ] = pipeline [ cfg ]. get (); 17 else out . data [ cfg ]. nothing (); 18 } 19 out . data . send (); 20 } 21 void on_data_accept () 22 { if ( in . data. know () && out . accept . know ()) 23 { for ( int cfg =0; cfg < NBCFG; cfg ++) 24 { if (! pipeline [ cfg ]. is_full () || out . accept [ cfg ]) 25 in . accept [ cfg ] = true ; 26 else in . accept [ cfg ] = false ; 27 out . enable [ cfg ] = out . accept [ cfg ]; 28 } 29 in . accept . send(); 30 out . enable . send (); 31 } 32 } 33 void end_of_cycle () 34 { for ( int cfg =0; cfg < NBCFG; cfg ++) 35 { if ( out . accept [ cfg ]) pipeline [ cfg ]. pop (); 36 if ( in . enable [ cfg ]) pipeline [ cfg ]. push ( in . data ); 37 pipeline [ cfg ]. run (); 38 } 39 } 40 private : 41 Fifo < instr > pipeline [ NBCFG ]; 42 };
  • 53. Simulator Vectorization Multi-cores Simulation • In our study, we performed simulations of multi-cores : 2, 4, 8, 16, 32 and 64.
  • 54. OoOSim : Out of Order Simulator OoOSim 4 modelises a generic superscalar out-of-order processor. The baseline simulator includes a 4-way superscalar core with an L1 instruction cache, an L1 write-back data cache, a bus and a dram. 4. Mourad Bouache, David Parello, Bernard Goossens. Acceleration of Modular simulation. In International Supercomputing Conference (ISC09) Hamburg, Germany, June 2009.
  • 55. OoOSim : Out of Order Simulator OoOSim : 12 modules 1 Fetcher, 2 AllocatorRenamer, 3 Dispatcher, 4 Scheduler, 5 RegisterFile, 6 Ret-Broadcast and CDBA:Common Data Bus Arbiter, 7 IntegerUnit, FloatingPointUnit and AddressGenerationUnit, 8 LoadStoreQueue, 9 Data caches L1 and L2, 10 Instruction cache L1, 11 Memory DRAM, 12 Reorder Buffer.
  • 56. OoOSim : Out of Order Simulator more than 15.000 code lines, 12 connected modules through 187 signals.
  • 57. Benchmarks Benchmarks : MiBench • Simulations were carried out by MiBench, divided into six suites targeted areas specific market for embedded applications : Automotive, Network, Security, Consumer Devices, Office Automation, and Telecommunications. Auto./Industrial Consummer Office Network Security Telecomm. susan (edges) jpeg stringsearch dijkstra sha FFT susan (corners) - - - rijndael - susan (smoothing) - - - - -
  • 58. Performance evaluation Simulation machine • Performance evaluation has been carried out on a cluster of 30 Intel Xeon 5148 dual-core processors clocked at 2.33GHz with a 4MBytes L2 cache.
  • 59. Results : simulation speed (without vectorization)
  • 60. simulation speed (with vectorization)
  • 62. Why ... ? Instrumentation of the FastSysC code(program) • Cycle Counters (RDTSC:Read Time Stamp Counter) : 1 The scheduler FastSysC transit time. 2 The process time.
  • 64. Conclusion Results • To address the need to improve the simulation speed, we proposed a developing modules methodology in a modular simulator. • This methodology is based on a new communication signals protocol . The vectorial simulation improves scalability.
  • 65. Results Discussion Vectorization ... • improves the speedup of the simulation time. • it allows duplicate resources by limiting the overhead of scheduler simulation time. • can be used in conjunction with other techniques to improve the speed as sampling techniques or reduction of test programs.
  • 67. Conclusion Conclusion Our contribution aims to improve the simulation speed in modular simulators, offering a simple and systematic development based on the vectorization of the simulator modules.
  • 68. Conclusion Simplescalar is not a multi-core simulator
  • 69. Conclusion Simplescalar is not a multi-core simulator
  • 70. In focus Other idea ... • Vectorization We wish to compare the results of this methodology using TTLM modeling (Timed Transaction Level Modeling).
  • 71. Merci, Thank you, Tack QUESTIONS ?
  • 72. Back-up slides Post-doc research work • Instruction Level Parallelism : ILP Goal : understand the general structure of an execution and parallelism it offers. • PerPi : A Tool to Measure Instruction Level Parallelism • http://kenny.univ-perp.fr/PerPi/ • A Pin tool, an Intel free programmable tool, • computes the instructions dependency graph, • computes, for each instruction in the run, its instruction cycle in the ideal machine, • Analysis of the structure of instruction-level parallelism, • Parallelism on loops, • Local and global parallelism, • Parallelism on function ”CALL”.
  • 75. Back-up slides SystemC and FastSysC SystemC, Contains a scheduler which manages signals and directs the process to start. It contains a sequential processes (sensitive to the clock) and combinatorial process (sensitive to input ports). FastSysC, a mixture of static and dynamic scheduling to avoid unnecessary awakening processes : thus optimize the simulation engine.